Overview
This is the example and evaluation dataset used for the SmartDoc 2017 Competition on
Recognition of Documents with Complex Layouts, as it was made available to the
participants of the competition.
Data Format
- ground_truth.png
- Description: Ideal image your method should produce. Included in
training/demo dataset only.
- Format: PNG image with 3 channels (RGB, no alpha) “Truecolor” (no indexed
colors) @ 8 bits / channel, sRGB color space, no embedded ICC profile. Embedded
ICC profiles will be ignored, and values will be assumed to be encoded with sRGB
even in the absence of specific file header.
- input.mp4
- Description: Video stream which should be processed by your method to produce
an image as close as possible to ground_truth.png.
- Format: No audio stream, 1 video stream: mpeg4 container, H264 encoding,
yuv420p color format, variable frame-rates. Frame size may be different from one
video to another, but we will target native video recording resolution from
smartphones which usually is full HD (1080p).
- reference_frame_NN_dewarped.png
- Description: Image of the same shape as the ground truth image:
participants should use either the shape of this image or the shape provided in
task_data.json to find the exact shape of the image they must generate. Other
shapes will results in a failure to evaluate the result. This dewarped image is
generated by “undoing” (“unwarping”) the perspective transform the ground truth
image has suffered, back-projecting the relevant image area into the target image
shape.
The “NN” value in the name indicates that this frame was the NN-th frame of
the video (0-indexed). It usually means it was the first exploitable frame we
found when generating the task. For most of the videos this will be “00”, but
you should not assume so. - Format: Same as ground_truth.png
- reference_frame_NN_extracted.png
- Description: The exact same frame from the video input which was
“unwarped” to produce the “dewarped” version.
- Format: Same as ground_truth.png
- reference_frame_NN_extracted_viz.png
- Description: Same as reference_frame_NN_extracted.png, but with an extra
visualization of the outline of the object to track drawn over the image.
- Format: Same as ground_truth.png
- taskdata.json - Description: An easy-to-parse file which contains a summary of important
coordinates and shapes of: the image to produce (_target_image_shape), the input
video frame (input_video_shape), the object to track
(object_coord_in_ref_frame) along with the id of the frame used as a reference
(reference_frame_id). - Format: JSON file similar to the example below.
Example of task_data.json file
{
"input_video_shape": {
"x_len": 1920,
"y_len": 1080
},
"target_image_shape": {
"x_len": 3508,
"y_len": 2480
},
"object_coord_in_ref_frame": {
"top_right": {
"y": -22.679962158203125,
"x": 1535.1053466796875
},
"bottom_left": {
"y": 830.49786376953125,
"x": 568.02178955078125
},
"bottom_right": {
"y": 985.6279296875,
"x": 1526.2147216796875
},
"top_left": {
"y": 177.77229309082031,
"x": 546.0078125
}
},
"reference_frame_id": 0
}
Notes:
- Point coordinates are float lists with x then y coordinate in pixels. Decimal
separator is the dot (“.”) and there may be no decimal part.
- The coordinates are expressed in the referential where the origin is at the top
left of the image, x axis is horizontal (positive toward right) and y axis is vertical
(positive toward bottom) — see illustration below.
- Coordinates may fall outside frame area because of a small part of the document
being out of frame.
- Target shape is an integer list [width, height] expressed in pixels.
- Frames are 0-indexed (first frame of the video has id 0).
