Allow namespace prefixes other than 'None' in PageXML
alexander-winkler opened this issue · comments
Eynollah, e.g., produces PageXML files that use an explicit prefix (xmlns:pc="http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15"
).
calamari_ocr/ocr/dataset/datareader/pagexml/reader.py
, however, expects the prefix to be 'None' and throws an error when processing an eynollah pagexml.
When I change line 120 of reader.py
from
ns = {"ns": root.nsmap[None]}
to
ns = {'ns' : 'http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15'}
it works. I'm not sure if you can generalize the namespace dictionary to cover both output styles. Maybe xpath's local-name
function (instead of lxml find
or findall
) is an alternative.
Maybe not the most elegant solution but I guess wildcarding namespaces might solve this.
So e.g. page = root.find(".//{*}Page")
instead of page = root.find(".//ns:Page", namespaces=ns)
Suggestion: in
and :ns = {"ns": root.nsmap[root.prefix]}
Maybe not the most elegant solution but I guess wildcarding namespaces might solve this.
So e.g.page = root.find(".//{*}Page")
instead ofpage = root.find(".//ns:Page", namespaces=ns)
Dinglehopper does something similar to stay PAGE-version-agnostic: When parsing the document, it determines the namespace of the PcGts
element and uses that for querying the XML. We cannot do this in OCR-D because we generate our PAGE API from the PAGE 2019 XSD but if we could we would since e.g. Transkribus uses an extended PAGE 2013 namespace and there is data with other PAGE namespaces around.
Hi @alexander-winkler, could you please test if #259 works with Eynollah output? I don't have a file for testing at hand. Would also be interesting if lxml behaves correctly when writing predictions to PAGE XML with explicit prefixes.
When I try to run calamari-predict
on an Eynollah xml I still get an error. I'm unable, however, to figure out if it's related to the issue here or to something else.
My call:
calamari-predict \
--data PageXML\
--data.text_index 1\
--data.images image.jpg\
--data.xml_files image.xml\
--checkpoint some_model.ckpt.json
The error I get:
CRITICAL 2021-06-12 15:30:53,892 tfaip.util.logging: Uncaught exception
Traceback (most recent call last):
File "/home/user/virtualenvs/dev_calamari/bin/calamari-predict", line 33, in <module>
sys.exit(load_entry_point(calamari-ocr==2.1.2, console_scripts, calamari-predict)())
File "/home/user/virtualenvs/dev_calamari/lib/python3.8/site-packages/calamari_ocr-2.1.2-py3.8.egg/calamari_ocr/scripts/predict.py", line 188, in main
run(args.root)
File "/home/user/virtualenvs/dev_calamari/lib/python3.8/site-packages/calamari_ocr-2.1.2-py3.8.egg/calamari_ocr/scripts/predict.py", line 126, in run
raise Exception("Empty dataset provided. Check your files argument (got {})!".format(args.files))
AttributeError: PredictArgs object has no attribute files
Here is a sample file: https://cloud.uni-halle.de/s/MCuYzWD2ABv464z
The error indicates that your file path are wrong (no files found). The error message is buggy though (check --data.images
and --data.xml_files
)
I'm not sure what the problem is:
My directory:
.
├── 16642027.jpg
└── 16642027.xml
My actual call:
calamari-predict --data PageXML --data.text_index 1 --data.images 16642027.jpg --data.xml_files 16642027.xml --checkpoint model.ckpt.json
according to your call, the image, xml, and model must be at the same location (no prefix dirs), however, your "directory" only shows two files... If this is true, you must provide a "relative" path to either the xml or model files
according to your call, the image, xml, and model must be at the same location (no prefix dirs), however, your "directory" only shows two files... If this is true, you must provide a "relative" path to either the xml or model files
The model is located in a different directory. I've replaced it in my post above to make things simpler.
Okay, just wanted to "clarify" the most obvious mistake. Luckily, I receive the same error using your files. I will examine this and hopefully find the reason!
Okay, some updates on that: There are no ns:TextLine
s present in the xml file, only text regions which is why Calamari does not find any content (to read). I guess the interesting regions are those with type="paragraph"
. @andbue is it reasonable to also scan for textregions, or only TextLine which should be the correct container.
I don't think it would be a great idea to run calamari on anything other than text lines. Eynollah should also be able to do line segmentation and output TextLine elements if I'm not totally mistaken.
Maybe @alexander-winkler could test the prediction on Eynollah before we close this? Or we could simply reopen if something goes wrong with the prediction.
I have tried it (cf. post above ) but ended up with an eynollah problem. I'll give it another try, but eynollah layout recognition takes some time, so bear with me.
Ok,
eynollah -i image.jpg -o output_dir -m path/to/models_eynollah -fl -cl
produces image.xml
with prefixed TextRegions *and TextLines, and the latter are OCR'ed properly with calamari-predict
.
Thanks for testing @alexander-winkler . I will close this and merge #259