Order the ground truth section by type ?

Question

Order the ground truth section by type ?

PonteIneptique opened this issue 6 years ago · comments

Hi there !
I found out that a GT section was added while I was tempted to create my own awesome list.
One thing that I think would be great is categorizing a little more this section (Manuscript / Early print / Modern / Contemporaneous ?). That would probably be a better way to browse these data.

Konstantin Baierer · Answer 1 · Tue Jan 08 2019 16:27:50 GMT+0800 (China Standard Time)

The list is @cneud's work and it's maintained at https://github.com/cneud/ocr-gt.

We're working on a project making open-source OCR readily deployable in libraries, archives etc. (https://github.com/OCR-D / http://ocr-d.de). An important part of highly accurate OCR esp. for historical texts is training, for training one needs the right ground truth and for it all to work together one needs to describe the ground truth itself, the corpora, the tools etc. in a structured way.

Therefore, we want to define a JSON schema for describing ground truth and restructure the list/table into a JSON file, c.f. cneud/ocr-gt#11. This should be aligned with OCR-D/spec#86 where we want to define schemas for both training data and trained models. Ideally, all inputs and outputs of individual steps in an OCR workflow would be data defined by such a schema.

@mittagessen has been describing the models for his kraken OCR engine in such a way for a while.

@wrznr @tboenig @cneud