graviti logoProductOpen DatasetsAbout
Request DemoSign in
158
0
0
ICDAR2019 Post-OCR Text Correction
General
Discussion
Code
Activities
c77b9ee8-8cd1-11eb-88ae-0e1f58d5e9a9
ba4a1d9·
Jun 20, 2021 9:58 AM
·1Commits

Overview

This original corpus consist in OCRed documents from 10 European languages with about 20M characters (3.5M tokens) aligned with their corresponding Gold Standard (Ground-Truth). Each language contain one or several sub-folders (unbalanced) according to collected dataset sources as follows:
Dataset details : partitioning md-1

The original excel form click here. Each training file contain three blocs according to the following structure. Note that only the first block [OCR_output] will be included in the test set.
md-2

Citation

@inproceedings{rigaud2019pocr,
 title="ICDAR 2019 Competition on Post-OCR Text Correction",
 author={Rigaud, Christophe and Doucet, Antoine and Coustaty, Mickael and Moreux, Jean-Philippe},
 year={2019},
 booktitle={Proceedings of the 15th International Conference on Document Analysis and Recognition (2019)}
 }
🎉Many thanks to Graviti Open Datasets for contributing the dataset
Basic Information
Application ScenariosNot Available
AnnotationsNot Available
TasksNot Available
LicenseCustom
Updated on2021-01-20 04:15:44
Metadata
Data TypeNot Available
Data Volume0
Annotation Amount0
File Size0B
Copyright Owner
ICDAR 2019
Annotator
Unknown
More Support Options
Start building your AI now
Get StartedContact