Calamari-OCR / calamari

Line based ATR Engine based on OCRopy

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

text regularizer: missing replacements, new group combinations

bertsky opened this issue · comments

While trying to compare GT4HistOCR model performance between Tesseract and Calamari, I stumbled over a few peculiarities of Calamari's (superb!) text pre/postprocessor.

First off, when using vanilla Levenshtein (unweighted, without character equivalences or normalization beyond NFC), CER is 0.8% and thus already pretty low on the Calamari model trained by the Qurator team, which seem to have used the default regularizer extended.

But when taking a closer look, it appears a lot of its remaining errors can be attributed to the quotes and various subgroups of the extended regularizer, amounting to about half of CER (0.4%)

Now, GT4HistOCR more or less corresponds to DTA / OCR-D GT transcription level 2 guidelines. To me that seems like a plausible compromise for OCR models: they are not as expensive to transcribe as level 3 (while still preserving many semantically important distinctions) and can still be reduced to level 1 automatically afterwards (if strong normalization is needed, e.g. for search indexing).

The Calamari equivalent of GT level 2 would be ['spaces', 'roman_digits', 'ligatures-consonantal'] IINM. I therefore suggest giving that combination a predefined group under a suitable name, say conservative (and probably even make that the new default). Also, it would not hurt mapping all official GT levels with aliases in Calamari:

  • gtlevel1: ['spaces', 'punctuation', 'quotes', 'various', 'roman_digits', 'ligatures-vocal', 'ligatures-consonantal', 'uvius']
  • gtlevel2: ['spaces', 'roman_digits', 'ligatures-consonantal'] (or 'conservative')
  • gtlevel3: ['spaces']

Furthermore, I believe these regularizers should be made available prominently on the CLIs:

  • by describing and advertising these options in calamari-train --data.post_proc.processors.5.replacement_groups
  • by adding and describing these options to calamari-predict and calamari-eval (for additional postprocessing beyond what's already in the model)
  • by creating a separate CLI merely for text post-processing (to re-use elsewhere), say calamari-textproc

And second, within the existing groups, I believe a few individual rule changes are worth considering:

  1. The quotes group does not contain the following characters yet: ‚ ‛ ‟ « » ‹ › 〟 〞‟ (low-9, high-reversed-9, angular and historical variants)
  2. In the quotes group, IMHO " (ASCII dq) to '' (double ASCII sq) normalization is not adequate under most circumstances and thus should be an extra.
  3. It's probably useful to also have rules for regularizing footnote numerals ⁰ ¹ ² ³ ⁴ ⁵ ⁶ ⁷ ⁸ ⁹ ꝰ to ASCII.
  4. In the various group,
    • add 𝛑π (U+1D6D1 MATHEMATICAL BOLD SMALL PI)
    • add 𝜋π (U+1D70B MATHEMATICAL ITALIC SMALL PI)
    • add 𝝅π (U+1D745 MATHEMATICAL BOLD ITALIC SMALL PI)
    • add 𝝿π (U+1D77F MATHEMATICAL SANS-SERIF BOLD SMALL PI)
    • add 𝞹π (U+1D7B9 MATHEMATICAL SANS-SERIF BOLD ITALIC SMALL PI)
    • add - (U+2212 MINUS SIGN)
    • add - (U+2010 HYPHEN)
    • add - (U+2011 NON-BREAKING HYPHEN)
    • add - (U+2012 FIGURE DASH)
    • add - (U+2015 QUOTATION DASH)
    • add - (U+2043 HYPHEN BULLET)
    • add - (U+FE58 SMALL EM DASH)
    • add - (U+2500 FORMS LIGHT HORIZONTAL)
    • add ~ (U+223C TILDE OPERATOR)
    • add ˜~ (U+02DC SMALL TILDE)
    • add ~ or - (U+2053 SWUNG DASH)
    • add ( (U+27E8 MATHEMATICAL LEFT ANGLE BRACKET)
    • add ) (U+27E9 MATHEMATICAL RIGHT ANGLE BRACKET)
    • add ( (U+207D SUPERSCRIPT LEFT PARENTHESIS)
    • add ) (U+207E SUPERSCRIPT RIGHT PARENTHESIS)
    • add / (U+2044 FRACTION SLASH) – but perhaps we instead should encourage this representation for fractions (even where precomposed codepoints exist)
    • add / (U+2215 DIVISION SLASH)
    • add \ (U+2216 SET MINUS)
  5. In the uvius group,
    • add ijij (U+0133 LATIN SMALL LIGATURE IJ)
    • add / (U+29F8 BIG SOLIDUS)
    • add \ (U+29F9 BIG REVERSE SOLIDUS)
    • add \ (U+29F5 REVERSE SOLIDUS OPERATOR)
    • use JI instead of IJ – because i is more common than j, is more conventional among canonicalizations (e.g. ietzt ietzo), and avoids additional misrepresentation of roman numerals

EDITED for correct CER measurement and its interpretation.

i fully agree with defining and adding further predefined groups like (ocr-d/dta level 1-3). however, in my opinion the default simply has to be "spaces". i think christoph already changed this a few days ago.

is level 2 really supposed to preserve all random PUA codes? i know it is not explicitely stated in the rules but this looks like an odd decision. it does not seem to matter now because uwe wiped them all out when creating the GT4HistOCR corpus but when adding more GT there will be further cases... the punctuation rules would have to be added as well, right?

i also fully agree on the CLI things. in addition, i think it would be helpful to have the rules and profiles available in an external data format, like for example in the PAGETools. @ChWick ?

regarding the proposed individual rule changes i will have to take a closer look but it all looks reasonable to me.

i fully agree with defining and adding further predefined groups like (ocr-d/dta level 1-3). however, in my opinion the default simply has to be "spaces". i think christoph already changed this a few days ago.

Yes, it certainly looks so:

is level 2 really supposed to preserve all random PUA codes? i know it is not explicitely stated in the rules but this looks like an odd decision.

No, IIUC level 2 should replace them all with some regularized form. @tboenig is currently working on a new version of the guidelines that are more readable – that should make it more clear. AFAICT we are only beginning to collect these cases/rules.

it does not seem to matter now because uwe wiped them all out when creating the GT4HistOCR corpus but when adding more GT there will be further cases... the punctuation rules would have to be added as well, right?

Yes. And punctuation in particular will be a likely candidate for debate (and refinement, esp. virgula and old punctuation characters).

i also fully agree on the CLI things. in addition, i think it would be helpful to have the rules and profiles available in an external data format, like for example in the PAGETools. @ChWick ?

Oh, @ChWick has already implemented this in a3668d9 – with pkg resources for rules. We could easily replace that with an external Python package (either from Calamari-OCR or managed by OCR-D or any of the other existing regularization libraries) in the future. Great work!

Am I right in assuming I will then somehow be able to reference these JSON file names in calamari-train --data.post_proc.processors.5.replacement_groups?

regarding the proposed individual rule changes i will have to take a closer look but it all looks reasonable to me.

We should coordinate with @tboenig, @stweil, @mikegerber et alii.

@bertsky

Am I right in assuming I will then somehow be able to reference these JSON file names in calamari-train --data.post_proc.processors.5.replacement_groups?

Yes, you can modify the groups, however the new names are rulesets and rulegroups, whereby rulegroups must be present in the resources of Calamari, and rulesets can be a list of predefined rule sets of Calamari (in the resources) or to an arbitrary json-path.