PlatformMarketplaceSolutionsResourcesOpen DatasetsCommunityCompany
update dataset overview and ba...
Feb 10, 2022 7:39 AM


The DOST dataset preserves scene texts observed in the real environment as they were. The dataset contains videos (sequential images) captured in shopping streets in downtown Osaka with an omnidirectional camera. Use of the omnidirectional camera contributes to excluding user’s intention in capturing images. Sequential images contained in the dataset contribute to encouraging developing a new kind of text detection and recognition techniques that utilize temporal information. Another important feature of DOST dataset is that it contains non-Latin text. Since the images were captured in Japan, a lot of Japanese text is contained while it also contains adequate amount of Latin text. Because of these features of the dataset, we can say that the DOST dataset preserved scene texts in the wild.

Data Fromat

Text Localisation in Video

The objective of this task is to obtain the location of words in the video in terms of their affine bounding boxes. The task requires that words are both localised correctly in every frame and tracked correctly over the video sequence. All the videos will be provided as MP4 files. Ground truth will be provided as a single XML file per video. The format of the ground truth file will follow the structure of the example below.

<?xml version="1.0" encoding="us-ascii"?>
  <frame ID="1">
    <object Transcription="T" ID="1001" Quality="low" Language="Spanish" Mirrored="unmirrored">
      <Point x="97" y="382" />
      <Point x="126" y="382" />
      <Point x="125" y="410" />
      <Point x="97" y="411" />
    <object Transcription="910" ID="1002" Quality="moderate" Language=Spanish" Mirrored="unmirrored">
      <Point x="607" y="305" />
      <Point x="640" y="305" />
      <Point x="639" y="323" />
      <Point x="609" y="322" />
  <frame ID="2">
// Represents an empty frame
  <frame ID="3">
    <object Transcription="T" ID="1001" Quality="moderate"Language="Spanish" Mirrored="unmirrored">
      <Point x="98" y="384" />
      <lt;Point x="127" y="384" />
      <Point x="125" y="412" />
      <Point x="97" y="413" />
    <object Transcription="910" ID="1002" Quality="high" Language="Spanish" Mirrored="unmirrored">
      <Point x="609" y="307" />
      <Point x="642" y="307" />
      <Point x="641" y="325" />
      <Point x="611" y="324" />

where <frames> is the root tag. <Frame ID="*num_frame*"> identifies the frame inside the video. ID is the index of the frame in the video.

**`<Object Transcription="`\***`transcription`**_`" ID="`_**`num_id`**_`" Language="`_**`language`**_`" Mirrored = "`_**`mirrored/unmirrored`**_`" Quality = "`_**`low/moderate/high`**\*`">`**

represents each of the objects (words) in the frame.

  • Transcription is the textual transcription of the word
  • ID is a unique identifier of an object; all occurrences of the same object have the same ID.
  • Language defines the language the word is written in
  • Mirrored is a boolean value that defines whether the word is seen through a mirrored surface or not
  • Quality is the quality of the text which can be one of those values: low, moderate or high. The low value is special, as it is used to define text areas that are unreadable. During the evaluation, such areas will not be taken into account: a method will not be penalised if it does not detect these words, while a method that detects them will not get any better score. <Point x="*000" y="000*" /> represents a point of the word bounding box in the image. Bounding boxes always comprise 4 points. See more information about the ground truthing protocol here If no objects exist in a particular frame the frame tag is created empty. Participants are required to automatically localise the words in the images and return affine bounding boxes in the same XML format. In the XML format of the users, only the ID attribute is expected for each object, any other attributes will be ignored. A single compressed (zip or rar) file should be submitted containing all the result files for all the videos of the test set. In the case that your method fails to produce any results for a particular video, you should include no XML file for that particular video.This database consists of historical handwritten marriages records from the Archives of the Cathedral of Barcelona. The pages we used correspond to the volume 69,written in old Catalan by one single writer in the 17th century. Each marriage record contains information about the husbands occupation, place of origin, husbands and wifes former marital status, parents occupation,place of residence, geographical origin, etc.

Still Image Mode

For the text localization task we will provide bounding boxes of words for each of the images. The ground truth is given as separate text files (one per image) where each line specifies the coordinates of one word's bounding box and its transcription in a comma separated format (see Figure 1).

Ch4_Task1_Figure1.pngFor the text localization task the ground truth data is provided in terms of word bounding boxes. Unlike Challenges 1 and 2, bounding boxes are NOT axis oriented in Challenge 4, and they are specified by the coordinates of their four corners in a clock-wise manner. For each image in the training set a separate UTF-8 text file will be provided, following the naming convention:

gt_[image name].txt

The text files are comma separated files, where each line will corresponds to one word in the image and gives its bounding box coordinates (four corners, clockwise) and its transcription in the format:

x1, y1, x2, y2, x3, y3, x4, y4, transcription

Please note that anything that follows the eighth comma is part of the transcription, and no escape characters are used. "Do Not Care" regions are indicated in the ground truth with a transcription of "###".

Cropped word recognition


For the word recognition task, we provide all the words in our dataset with 3 characters or more in separate image files, along with the corresponding ground-truth transcription (See Figure 2 for examples). For each word the axis oriented area that tighly contains the word will be provided.

The transcription of all words is provided in a SINGLE UTF-8 text file for the whole collection. Each line in the ground truth file has the following format:

[word image name]*, "transcription*"

An example is given in figure 2. Please note that the escape character () is used for double quotes and backslashes.

In addition, the relative coordinates of the (non-axis oriented) bounding box that defines the word within the cut-out word image will be provided in a separate SINGLE text file for the whole collection. Coordinates of the words are given in reference to the cut-out box, as the four corners of the bounding box in a clock-wise manner. Each line in the ground truth file has the following format:

[word image name], x1, y1, x2, y2, x3, y3, x4, y4

An example is given in figure 2.

For testing we will provide the images of about 2000 words and we will ask for the transcription of each image. A single transcription per image will be requested. The authors should return all transcriptions in a single text file of the same format as the ground truth.

For the evaluation we will calculate the edit distance between the submitted image and the ground truth transcription. Equal weights will be set for all edit operations. The best performing method will be the one with the smallest total edit distance.


Video Mode

In the video mode, each place consists of following files.

  1. Video Consecutive images are provided as a single video file. File name: "video_x_y_z.mp4" where x is for the place ID (1-4), y camera ID (0-4) and z serial number within the camera and video. Example: video_1_0_0.mp4 is the first consecutive frames of camera 0 of place 1.
  2. Ground truth Ground truth is provided in an XML file for each place. The format is almost same as Robust Reading Competition 2013/2015 Challenge 3, Text in Videos. The difference is as follows. Instead of "language" tag in RRC Challenge 3, DOST uses "script" tag to distinguish Japanese text and the others. The script tag takes "Latin" or "Japanese".
  3. Video with GT For reference, video in which GT texts are displayed is provided. File name: "video_x_y_z_GT.mp4"

Still Image Mode

Localisation (Task I1) and End-to-end (Task I3) Tasks Each place consists of following files.

  1. Image
    Images sampled every 10 frames from the videos are provided. File name: "image_x_y_z.png" where x is for the place ID (1-4), y camera ID (0-4) and z frame number. Note z is not consecutive (e.g., 0, 10, 20 and so on). Example: image_1_0_10.mp4 is frame 10 of camera 0 of place 1. List of image files contained in a specific sequence is also provided. Example of file name: "filelist_place4_cam_0_seq1.txt" for Place4, Camera 0 and Sequence 1.
  2. Ground truth
    Ground truth is provided in a text file for each image. The format is the same as Robust Reading Competition 2015 Challenge 4, Incidental Scene Text
  3. Image with GT For reference, image in which GT texts are displayed is provided. File name: "image_x_y_z_GT.png"

Cropped word recognition

Following the manner of Robust Reading Competition 2015 Challenge 4, Incidental Scene Text, following files are provided.

  1. Image Selected cropeed images are provided separately in Latin and Japanese scripts. File name: "lang_x.png" where lang is "L" for Latin or "J" for Japanese and x is the ID of the word. Example: "L_10.png" is the 10th image of Latin script.
  2. Coordinates Coordinates of four points of the bounding boxes surrounding the word are given. The format is the same as Robust Reading Competition 2015 Challenge 4, Incidental Scene Text, File names are "coords_Latin.txt" for Latin and "coords_Japanese.txt" for Japanese.
  3. Ground truth Ground truth is provided in a text file for each image. The format is the same as Robust Reading Competition 2015 Challenge 4, Incidental Scene Text. File names are "gt_Latin.txt" for Latin and "gt_Japanese.txt" for Japanese.

Japanese character images of various fonts

[New in March 2018] Level-2 Kanji character images are added. We provide Japanese character images of various fonts to provide an opportunity for researchers to try to recognize Japanese scene text. You can download the images from the links in Table 5. Since the scene images of the DOST dataset were taken in Osaka, Japan, it inherently contains character images of Japanese script in addition to Latin script. Because of lack of appropriate Japanese character datasets that can be used for training a classifier, it is not easy for non-Japanese researchers to try to recognize Japanese characters. To solve the problem and encourage non-Japanese researchers, we prepared this archive that contains Japanese character images of various fonts rendered by us. In the first release in mid-April 2017, we provide 53 fonts included in Among them, 15 fonts are from Mac OS X (El Capitan) and 38 fonts from Windows 8.1. In the second release in early-May 2017, we provide 724 fonts included in MORISAWA PASSPORT Academic Release are included in through Each font consists of 3250 characters selected from the list of Japanese characters including level-1 kanji (written in Japanese). In the third release in March 2018, we added level-2 kanji images (see the list written in Japanese) in addition to level-1 kanji. The file name of each character image is given by its UTF-8 character code. For example, the character image file of "あ" is named E38182.png. We provide python scripts that can convert from characters to their codes and vice versa included in

  • convert Japanese characters into codes
  • convert codes into Japanese characters In both programs, characters not contained in the list of Japanese characters such as ASCII characters are ignored. [Example #1] From characters to codes % echo DOST データベース | python3 D O S T E38387 E383BC E382BF E38399 E383BC E382B9 [Example #2] From codes to characters % echo D O S T E38387 E383BC E382BF E38399 E383BC E382B9 | python3 DOST データベース.
🎉Many thanks to Graviti Open Datasets for contributing the dataset
Basic Information
Application ScenariosNot Available
AnnotationsNot Available
TasksNot Available
LicenseCC BY 4.0
Updated on2022-02-10 07:39:21
Data TypeNot Available
Data Volume0
Annotation Amount0
File Size0.00B
Copyright Owner
Computer Vision Center (CVC)