Calamari-OCR / calamari

Line based ATR Engine based on OCRopy

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Allow namespace prefixes other than 'None' in PageXML

alexander-winkler opened this issue · comments

Eynollah, e.g., produces PageXML files that use an explicit prefix (xmlns:pc="http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15").

calamari_ocr/ocr/dataset/datareader/pagexml/reader.py, however, expects the prefix to be 'None' and throws an error when processing an eynollah pagexml.

When I change line 120 of reader.py from

ns = {"ns": root.nsmap[None]} 

to

ns = {'ns' : 'http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15'}

it works. I'm not sure if you can generalize the namespace dictionary to cover both output styles. Maybe xpath's local-name function (instead of lxml find or findall) is an alternative.

Maybe not the most elegant solution but I guess wildcarding namespaces might solve this.
So e.g. page = root.find(".//{*}Page") instead of page = root.find(".//ns:Page", namespaces=ns)

Suggestion: in

and :
ns = {"ns": root.nsmap[root.prefix]}

Maybe not the most elegant solution but I guess wildcarding namespaces might solve this.
So e.g. page = root.find(".//{*}Page") instead of page = root.find(".//ns:Page", namespaces=ns)

Dinglehopper does something similar to stay PAGE-version-agnostic: When parsing the document, it determines the namespace of the PcGts element and uses that for querying the XML. We cannot do this in OCR-D because we generate our PAGE API from the PAGE 2019 XSD but if we could we would since e.g. Transkribus uses an extended PAGE 2013 namespace and there is data with other PAGE namespaces around.

Hi @alexander-winkler, could you please test if #259 works with Eynollah output? I don't have a file for testing at hand. Would also be interesting if lxml behaves correctly when writing predictions to PAGE XML with explicit prefixes.

When I try to run calamari-predict on an Eynollah xml I still get an error. I'm unable, however, to figure out if it's related to the issue here or to something else.

My call:

calamari-predict \
    --data PageXML\
    --data.text_index 1\
    --data.images image.jpg\
    --data.xml_files image.xml\
    --checkpoint some_model.ckpt.json

The error I get:

CRITICAL 2021-06-12 15:30:53,892             tfaip.util.logging: Uncaught exception
Traceback (most recent call last):
  File "/home/user/virtualenvs/dev_calamari/bin/calamari-predict", line 33, in <module>
    sys.exit(load_entry_point(calamari-ocr==2.1.2, console_scripts, calamari-predict)())
  File "/home/user/virtualenvs/dev_calamari/lib/python3.8/site-packages/calamari_ocr-2.1.2-py3.8.egg/calamari_ocr/scripts/predict.py", line 188, in main
    run(args.root)
  File "/home/user/virtualenvs/dev_calamari/lib/python3.8/site-packages/calamari_ocr-2.1.2-py3.8.egg/calamari_ocr/scripts/predict.py", line 126, in run
    raise Exception("Empty dataset provided. Check your files argument (got {})!".format(args.files))
AttributeError: PredictArgs object has no attribute files

Here is a sample file: https://cloud.uni-halle.de/s/MCuYzWD2ABv464z

The error indicates that your file path are wrong (no files found). The error message is buggy though (check --data.images and --data.xml_files)

I'm not sure what the problem is:
My directory:

.
├── 16642027.jpg
└── 16642027.xml

My actual call:

calamari-predict --data PageXML --data.text_index 1 --data.images 16642027.jpg --data.xml_files 16642027.xml --checkpoint model.ckpt.json

according to your call, the image, xml, and model must be at the same location (no prefix dirs), however, your "directory" only shows two files... If this is true, you must provide a "relative" path to either the xml or model files

according to your call, the image, xml, and model must be at the same location (no prefix dirs), however, your "directory" only shows two files... If this is true, you must provide a "relative" path to either the xml or model files

The model is located in a different directory. I've replaced it in my post above to make things simpler.

Okay, just wanted to "clarify" the most obvious mistake. Luckily, I receive the same error using your files. I will examine this and hopefully find the reason!

Okay, some updates on that: There are no ns:TextLines present in the xml file, only text regions which is why Calamari does not find any content (to read). I guess the interesting regions are those with type="paragraph". @andbue is it reasonable to also scan for textregions, or only TextLine which should be the correct container.

I don't think it would be a great idea to run calamari on anything other than text lines. Eynollah should also be able to do line segmentation and output TextLine elements if I'm not totally mistaken.

@andbue Agreed!
Can this be closed? (And the corresponding PR #259 be merged?)

Maybe @alexander-winkler could test the prediction on Eynollah before we close this? Or we could simply reopen if something goes wrong with the prediction.

I have tried it (cf. post above ) but ended up with an eynollah problem. I'll give it another try, but eynollah layout recognition takes some time, so bear with me.

Ok,

eynollah -i image.jpg -o output_dir -m path/to/models_eynollah -fl -cl

produces image.xml with prefixed TextRegions *and TextLines, and the latter are OCR'ed properly with calamari-predict.

Thanks for testing @alexander-winkler . I will close this and merge #259