OCR-D / core

Collection of OCR-related python tools and wrappers from @OCR-D

Home Page:https://ocr-d.de/core/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

PageValidator: more options / actions

bertsky opened this issue · comments

IMO we should strive to support much more validation and repair features in ocrd_validators.page_validator – esp. functionality known from PRImA Converter and Validator (PCV) and HTR United VX (HTRVX).

From PCV:

  -val-rules <Rule1[,Rule2,...]>: Defines what to validate (optional)
                                  Note: If no rules are defined, everthing
                                        is validated.
             Available rules (use comma only; no spaces):
               General Checks:
                 VALIDATE_GTSID_DEFINED
                 VALIDATE_LAYERS
               Reading Order:
                 VALIDATE_READING_ORDER_DEFINED
                 VALIDATE_READING_ORDER_COMPLETE
                 VALIDATE_TYPE_OF_REGIONS_IN_READING_ORDER
               Region Related Checks:
                 VALIDATE_REGIONS_WITHIN_DOCUMENT_BOUNDARIES
                 VALIDATE_REGIONS_WITHIN_BORDER
                 VALIDATE_REGIONS_DONT_OVERLAP
                 CALCULATE_REGION_OVERLAP_AREA
                 VALIDATE_REGION_WITHIN_PARENT_REGION
                 VALIDATE_NO_INTERSECTING_POLYGON_LINES
                 VALIDATE_GHOST_REGIONS
                 VALIDATE_PRINTSPACE
                 VALIDATE_PENDING_REGIONS
                 VALIDATE_COMPONENTS_INSIDE_REGIONS (requires image)
                 VALIDATE_MISSING_ELEMENTS
                 VALIDATE_NESTED_REGIONS
               Text Related Checks:
                 VALIDATE_TEXT_DEFINED
                 VALIDATE_UNICODE_TEXT_DEFINED
                 VALIDATE_DEPRECATED_CHARACTERS
                 VALIDATE_REPLACEMENT_CHARACTER
                 VALIDATE_PENDING_CHARACTER
                 VALIDATE_TEXT_CONTENT
               Other:
                 STRUCTURAL_INTEGRITY
  -val-params <INI file>: Load additional validation parameters (optional)
  -remove <Filter1[,Filter2,...]>: Remove layout objects (optional).
          Available filters (use comma only; no spaces):
            REGIONS,NESTED_REGIONS,TEXT_LINES,WORDS,GLYPHS,READING_ORDER,LAYERS
  -remove-ghosts <Filter1[,Filter2,...]>: Remove ghost objects (optional)
                                          Ghosts are regions, text lines,
                                          words or glyphs without outline.
                 Available filters (use comma only; no spaces):
                   REGIONS,TEXT_LINES,WORDS,GLYPHS,ALL
  -convert-text <XML file with rules>: Text content conversion (optional)
  -apply-offset <offsetX,offsetY>: Move all layout objects by specified offset
                                   (optional)
                Example: -10,20  (no spaces!)
  -scale <scaleX,scaleY>: Scale all layout objects by specified factor
                          Use 'auto' for scaleX and/or scaleY to scale using
                          the difference between image and XML dimensions.
                          (optional) (done after apply-offset)
                Example: 0.5,0.5  (no spaces!)
  -rotate <degrees>: Rotates all polygon points of all layout objects clockwise
                     around the centre of the page.
  -refine-outlines: Refine region outlines. Applies to conversion from non-PAGE
                    formats (e.g. ALTO) if supported.

General Checks:

VALIDATE_GTSID_DEFINED

trivial

repair: uuid, or rather based on file name?

VALIDATE_LAYERS

Checks if all regions are assigned to
layers (only if there exists at least
one layer).

Not entirely sure we need this, and what it is used for.

Reading Order:

VALIDATE_READING_ORDER_DEFINED

trivial

repair: see page-ensure-readingorder

VALIDATE_READING_ORDER_COMPLETE

Some text regions are missing in the
reading order.

trivial

but not so easy to repair (would require merging existing RO with generated entries)

VALIDATE_TYPE_OF_REGIONS_IN_READING_ORDER

There are one or more regions within
the reading order that shouldn’t be
there (only paragraphs, headings,
drop-capitals, catch-words and TOC-
entries are supposed to be in the
reading order).

trivial

repair: maybe just a filter?

Region Related Checks:

VALIDATE_REGIONS_WITHIN_DOCUMENT_BOUNDARIES

already covered by check_coords

repair: see ocrd-segment-repair

VALIDATE_REGIONS_WITHIN_BORDER

already covered by check_coords

repair: see ocrd-segment-repair

VALIDATE_REGIONS_DONT_OVERLAP

trivial

repair: not trivial, but see ocrd-segment-repair with plausibilize=true and ocrd-cis-ocropy-clip

CALCULATE_REGION_OVERLAP_AREA

trivial

VALIDATE_REGION_WITHIN_PARENT_REGION

already covered by check_coords

repair: see ocrd-segment-repair

VALIDATE_NO_INTERSECTING_POLYGON_LINES

One ore more polygons have
intersecting lines and therefore
contain loops.

i.e. self-intersection

already covered by check_coords

repair: see ocrd-segment-repair

VALIDATE_GHOST_REGIONS

Looks for ghost regions (regions
without outline)

not sure what this exactly means; zero area? negligent area? no coords at all (would already be syntactically invalid)?

VALIDATE_PRINTSPACE

Checks if regions of type page-
number, signature-mark, marginalia
or catch-word are NOT within the
print space

trivial

repair: shrink PrintSpace?

VALIDATE_PENDING_REGIONS

There are text regions without
parent (e.g. word without text line)

How is that even possible syntactically?

VALIDATE_COMPONENTS_INSIDE_REGIONS (requires image)

One ore more regions contain
connected components that are
partly outside the region.

not difficult

repair: see ocrd-cis-ocropy-clip, ocrd-cis-ocropy-resegment (for textlines) and functions postprocess/morphmasks in ocrd-detectron2

VALIDATE_MISSING_ELEMENTS

Some text elements have no child
elements

again, unclear what that entails

Text Related Checks:

VALIDATE_TEXT_DEFINED

Some text regions have no text
ground-truth.

trivial

VALIDATE_UNICODE_TEXT_DEFINED

Some text regions have plain text
defined but not Unicode text.

trivial

VALIDATE_DEPRECATED_CHARACTERS

Deprecated characters are characters that were linked to a private Unicode code point but now have
a dedicated slot in the normal Unicode sections. The filter corrects such changes

doable

VALIDATE_REPLACEMENT_CHARACTER

Checks for occurances of the
replacement character (Unicode
+FFFD) in text element

trivial

VALIDATE_PENDING_CHARACTER

Checks for occurances of pending
characters (Unicode +F51C) in text
elements

trivial

VALIDATE_TEXT_CONTENT

Checks the content of text elements
for inconsistencies (e.g. spaces in
words, trailing line breaks, non
matching text of child and parent
text objects)

already covered by page_textequiv_consistency and page_textequiv_strategy

repair: see page-textequiv-lines-to-regions, page-textequiv-words-to-lines, and function page_update_higher_textequiv_levels in all our *-recognize processors

Other:

STRUCTURAL_INTEGRITY

For XML validation errors, XML
reader warnings (e.g. old PAGE
format) and wrong image
dimensions.

would need to look at the exact implementation, but we should indeed pass any errors from the schema-backed (generateds) parser and present them as (actionable) exceptions

-remove <Filter1[,Filter2,...]>: Remove layout objects (optional).

Available filters (use comma only; no spaces):
REGIONS,NESTED_REGIONS,TEXT_LINES,WORDS,GLYPHS,READING_ORDER,LAYERS

trivial

-remove-ghosts <Filter1[,Filter2,...]>: Remove ghost objects (optional)

Available filters (use comma only; no spaces):
REGIONS,TEXT_LINES,WORDS,GLYPHS,ALL
Ghosts are regions, text lines,
words or glyphs without outline.

see above

-convert-text : Text content conversion (optional)

To apply a filter use -convert-text in the command line call and provide the file name of the XML file
containing the filter rules as additional argument. The XML file must have the following format:

<?xml version="1.0" encoding="utf-8"?>
<Parameters>
  <Parameter type="4" name="Replacement Rule"
                    id="1"
                    sortIndex="1"
                    value="0065:=0061"
                    visible="false"
                    isSet="true">
    <Description>Replace e by a</Description>
  </Parameter>
  <Parameter type="4" name="Replacement Rule"
                    id="2"
                    sortIndex="2"
                    value="0074:="
                    visible="false"
                    isSet="true">
    <Description>Deletes all t</Description>
  </Parameter>
</Parameters>

Each parameter element contains a replacement rule. The sortIndex attribute specifies in which
order the rules will be applied. The id attribute must be unique (easiest to use the same value as the
sort index). The description is optional but helps to understand the rules. The actual rule is encoded
in the value attribute. The general format is “HHHH[,HHHH,...]:=[HHHH,HHHH,...]”. HHHH is a
Unicode character represented as 4 digit hexadecimal number. In the example above “0065:=0061”
means ‘replace all characters e with character a’. To replace a character sequence separate the
single characters by comma. The same applies for the right-hand side (the replacement character or
sequence). It is also possible to remove characters by leaving the right-hand side empty (e.g.
“0074:=” to delete all ts).

Sounds like a lot of effort for little gain. People could write their own XSLTs and text processors. But perhaps there are already tons of existing patterns, so supporting this mechanism does have merit?

-apply-offset <offsetX,offsetY>: Move all layout objects by specified offset
(optional)
Example: -10,20 (no spaces!)

trivial

-refine-outlines: Refine region outlines. Applies to conversion from\n");
non-PAGE formats (e.g. ALTO) if supported.\n")

not sure what that entails

From HTRVX:

-s, --segmonto 	False 	Apply Segmonto Zoning verification
--zone TEXT 	None 	Provide a custom zone to control zone types instead of Segmonto
--line TEXT 	None 	Provide a custom line to control Line types instead of Segmonto

tbh, I don't understand that part

-e, --check-empty 	False 	Check for empty lines or empty zones
-r, --raise-empty 	False 	Warns but not fails if empty lines or empty zones are found

see above

-x, --xsd 	False 	Apply XSD Schema verification

see above

-i, --check-image 	False 	Check if the image link in the XML points to the right path

already covered