OCR_Testdata_EarlyPrintedBooks

A selection of test lines of several early printed books as well as the corresponding individual OCRopus models and mixed models.

Data

LatinHist-98000.pyrnn.gz is a mixed OCRopus model trained on twelve Latin books printed with Antiqua types between 1471 and 1686 with a focus (ten out of twelve) on early works produced before 1600. For details about the books please see [1] . The training was performed on 8,684 lines and the best model was chosen by evaluating all resulting models on 2,432 previously unseen test lines. The lowest achieved CER was 2.92% after 98,000 training steps.

The Books folder contains seven early printed books used for evaluation in [2]. For each book we made 150 lines of GT available as well as a strong individual model (coming soon) trained on an extensive amount of lines (at least 1,500). For details about the books please see [2].

For additional GT for Latin, Greek and German Fraktur please see the CIS OCR Testset.

Licence

All data available in this repository is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Citation

Please cite (one of) the following publications when using the data.

Latin Hist

[1] Springmann et al.. (2016). Automatic quality evaluation and (semi-) automatic improvement of OCR models for historical printings. ArXiv e-prints.

@article{springmann2016automatic, title = {Automatic quality evaluation and (semi-) automatic improvement of {OCR} models for historical printings}, author = {Springmann, Uwe and Fink, Florian and Schulz, Klaus U}, journal = {ArXiv e-prints}, url = {https://arxiv.org/abs/1606.05157}, year = {2016} }

Seven Mixed Early Printed Books

[2] Reul et al.. (2017). Improving OCR Accuracy on Early Printed Books by Utilizing Cross Fold Training and Voting. ArXiv e-prints. Submitted to the 13th IAPR International Workshop on Document Analysis Systems.

@article{reul2017voting, title = {Improving {OCR} Accuracy on Early Printed Books by Utilizing Cross Fold Training and Voting}, author = {Reul, Christian and Springmann, Uwe and Wick, Christoph and Puppe, Frank}, journal = {ArXiv e-prints}, url = {https://arxiv.org/abs/1711.09670}, year = {2017} }

uvius / OCR_Testdata_EarlyPrintedBooks