The DOST dataset preserves scene texts observed in the real environment as they were. The dataset contains videos (sequential images) captured in shopping streets in downtown Osaka with an omnidirectional camera. Use of the omnidirectional camera contributes to excluding user’s intention in capturing images. Sequential images contained in the dataset contribute to encouraging developing a new kind of text detection and recognition techniques that utilize temporal information. Another important feature of DOST dataset is that it contains non-Latin text. Since the images were captured in Japan, a lot of Japanese text is contained while it also contains adequate amount of Latin text. Because of these features of the dataset, we can say that the DOST dataset preserved scene texts in the wild.
The objective of this task is to obtain the location of words in the video in terms of their affine bounding boxes. The task requires that words are both localised correctly in every frame and tracked correctly over the video sequence. All the videos will be provided as MP4 files. Ground truth will be provided as a single XML file per video. The format of the ground truth file will follow the structure of the example below.
<?xml version="1.0" encoding="us-ascii"?>
<frames>
<frame ID="1">
<object Transcription="T" ID="1001" Quality="low" Language="Spanish" Mirrored="unmirrored">
<Point x="97" y="382" />
<Point x="126" y="382" />
<Point x="125" y="410" />
<Point x="97" y="411" />
</object>
<object Transcription="910" ID="1002" Quality="moderate" Language=Spanish" Mirrored="unmirrored">
<Point x="607" y="305" />
<Point x="640" y="305" />
<Point x="639" y="323" />
<Point x="609" y="322" />
</object>
</frame>
<frame ID="2">
// Represents an empty frame
</frame>
<frame ID="3">
<object Transcription="T" ID="1001" Quality="moderate"Language="Spanish" Mirrored="unmirrored">
<Point x="98" y="384" />
<lt;Point x="127" y="384" />
<Point x="125" y="412" />
<Point x="97" y="413" />
</object>
<object Transcription="910" ID="1002" Quality="high" Language="Spanish" Mirrored="unmirrored">
<Point x="609" y="307" />
<Point x="642" y="307" />
<Point x="641" y="325" />
<Point x="611" y="324" />
</object>
</frame>
</frames>
where <frames>
is the root tag.
<Frame ID="
*num_frame
*">
identifies the frame inside the
video. ID is the index of the frame in the video.
**`<Object Transcription="`\***`transcription`**_`" ID="`_**`num_id`**_`" Language="`_**`language`**_`" Mirrored = "`_**`mirrored/unmirrored`**_`" Quality = "`_**`low/moderate/high`**\*`">`**
represents each of the objects (words) in the frame.
Transcription
is the textual transcription of the wordID
is a unique identifier of an object; all occurrences of the
same object have the same ID.Language
defines the language the word is written inMirrored
is a boolean value that defines whether the word is seen
through a mirrored surface or notQuality
is the quality of the text which can be one of those
values: low, moderate or high. The low value is special, as it is used
to define text areas that are unreadable. During the evaluation, such
areas will not be taken into account: a method will not be penalised if
it does
not detect these words, while a method that detects them will not get
any better score.
<Point x="
*000
" y="
000
*" />
represents a
point of the word bounding box in the image. Bounding boxes always
comprise 4 points. See more information about the ground truthing
protocol here
If no objects exist in a particular frame the frame tag is created
empty.
Participants are required to automatically localise the words in the
images and return affine bounding boxes in the same XML format. In the
XML
format of the users, only the ID attribute is expected for each object,
any other attributes will be ignored.
A single compressed (zip or rar) file should be submitted containing all
the result files for all the videos of the test set. In the case that
your method fails to produce any results for a particular video, you
should include no XML file for that particular video.This database
consists of historical handwritten marriages records from the Archives
of the Cathedral of Barcelona. The pages we used correspond to the
volume 69,written in old Catalan by one single writer in the 17th
century. Each marriage record contains information about the husbands
occupation, place of origin, husbands and wifes former marital status,
parents occupation,place of residence, geographical origin, etc.For the text localization task we will provide bounding boxes of words for each of the images. The ground truth is given as separate text files (one per image) where each line specifies the coordinates of one word's bounding box and its transcription in a comma separated format (see Figure 1).
For
the text localization task the ground truth data is provided in terms of
word bounding boxes. Unlike Challenges 1 and 2, bounding boxes are NOT axis
oriented in Challenge 4, and they are specified by the coordinates of their
four corners in a clock-wise manner. For each image in the training set a
separate UTF-8 text file will be provided, following the naming convention:
gt_[image name].txt
The text files are comma separated files, where each line will corresponds to one word in the image and gives its bounding box coordinates (four corners, clockwise) and its transcription in the format:
x1, y1, x2, y2, x3, y3, x4, y4, transcription
Please note that anything that follows the eighth comma is part of the transcription, and no escape characters are used. "Do Not Care" regions are indicated in the ground truth with a transcription of "###".
For the word recognition task, we provide all the words in our dataset with 3 characters or more in separate image files, along with the corresponding ground-truth transcription (See Figure 2 for examples). For each word the axis oriented area that tighly contains the word will be provided.
The transcription of all words is provided in a SINGLE UTF-8 text file for the whole collection. Each line in the ground truth file has the following format:
[word image name]
*, "transcription*"
An example is given in figure 2. Please note that the escape character () is used for double quotes and backslashes.
In addition, the relative coordinates of the (non-axis oriented) bounding box that defines the word within the cut-out word image will be provided in a separate SINGLE text file for the whole collection. Coordinates of the words are given in reference to the cut-out box, as the four corners of the bounding box in a clock-wise manner. Each line in the ground truth file has the following format:
[word image name], x1, y1, x2, y2, x3, y3, x4, y4
An example is given in figure 2.
For testing we will provide the images of about 2000 words and we will ask for the transcription of each image. A single transcription per image will be requested. The authors should return all transcriptions in a single text file of the same format as the ground truth.
For the evaluation we will calculate the edit distance between the submitted image and the ground truth transcription. Equal weights will be set for all edit operations. The best performing method will be the one with the smallest total edit distance.
In the video mode, each place consists of following files.
Localisation (Task I1) and End-to-end (Task I3) Tasks Each place consists of following files.
Following the manner of Robust Reading Competition 2015 Challenge 4, Incidental Scene Text, following files are provided.
[New in March 2018] Level-2 Kanji character images are added. We provide Japanese character images of various fonts to provide an opportunity for researchers to try to recognize Japanese scene text. You can download the images from the links in Table 5. Since the scene images of the DOST dataset were taken in Osaka, Japan, it inherently contains character images of Japanese script in addition to Latin script. Because of lack of appropriate Japanese character datasets that can be used for training a classifier, it is not easy for non-Japanese researchers to try to recognize Japanese characters. To solve the problem and encourage non-Japanese researchers, we prepared this archive that contains Japanese character images of various fonts rendered by us. In the first release in mid-April 2017, we provide 53 fonts included in Japanese_chars1.zip. Among them, 15 fonts are from Mac OS X (El Capitan) and 38 fonts from Windows 8.1. In the second release in early-May 2017, we provide 724 fonts included in MORISAWA PASSPORT Academic Release are included in Japanese_chars2.zip through Japanese_chars9.zip. Each font consists of 3250 characters selected from the list of Japanese characters including level-1 kanji (written in Japanese). In the third release in March 2018, we added level-2 kanji images (see the list written in Japanese) in addition to level-1 kanji. The file name of each character image is given by its UTF-8 character code. For example, the character image file of "あ" is named E38182.png. We provide python scripts that can convert from characters to their codes and vice versa included in Japanese_chars_tools.zip.