bagger creates invalid URL refs
bertsky opened this issue · comments
We now have OCR-D GT under Github (the old KIT repo has been down for a while, so this is the only place to get the data). It gets created via @tboenig's gt-repo-template, which uses the bagger to create the bagit zips.
Unfortunately, these bags are unusable:
Traceback (most recent call last):
File "/venv38/bin/ocrd-cis-ocropy-recognize", line 33, in <module>
sys.exit(load_entry_point('ocrd-cis', 'console_scripts', 'ocrd-cis-ocropy-recognize')())
File "/venv38/lib/python3.8/site-packages/click/core.py", line 1130, in __call__
return self.main(*args, **kwargs)
File "/venv38/lib/python3.8/site-packages/click/core.py", line 1055, in main
rv = self.invoke(ctx)
File "/venv38/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/venv38/lib/python3.8/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "/ocrd_cis/ocropy/cli.py", line 48, in ocrd_cis_ocropy_recognize
return ocrd_cli_wrap_processor(OcropyRecognize, *args, **kwargs)
File "/venv38/lib/python3.8/site-packages/ocrd/decorators/__init__.py", line 133, in ocrd_cli_wrap_processor
run_processor(processorClass, mets_url=mets, workspace=workspace, **kwargs)
File "/venv38/lib/python3.8/site-packages/ocrd/processor/helpers.py", line 133, in run_processor
raise err
File "/venv38/lib/python3.8/site-packages/ocrd/processor/helpers.py", line 130, in run_processor
processor.process()
File "/ocrd_cis/ocropy/recognize.py", line 162, in process
page_image, page_coords, _ = self.workspace.image_from_page(
File "/venv38/lib/python3.8/site-packages/ocrd/workspace.py", line 636, in image_from_page
page_image_info = self.resolve_image_exif(page.imageFilename)
File "/venv38/lib/python3.8/site-packages/ocrd/workspace.py", line 463, in resolve_image_exif
with download_temporary_file(image_url) as f:
File "/usr/lib/python3.8/contextlib.py", line 113, in __enter__
return next(self.gen)
File "/venv38/lib/python3.8/site-packages/ocrd/workspace.py", line 53, in download_temporary_file
with requests.get(url) as r:
File "/venv38/lib/python3.8/site-packages/requests/api.py", line 73, in get
return request("get", url, params=params, **kwargs)
File "/venv38/lib/python3.8/site-packages/requests/api.py", line 59, in request
return session.request(method=method, url=url, **kwargs)
File "/venv38/lib/python3.8/site-packages/requests/sessions.py", line 573, in request
prep = self.prepare_request(req)
File "/venv38/lib/python3.8/site-packages/requests/sessions.py", line 484, in prepare_request
p.prepare(
File "/venv38/lib/python3.8/site-packages/requests/models.py", line 368, in prepare
self.prepare_url(url, params)
File "/venv38/lib/python3.8/site-packages/requests/models.py", line 439, in prepare_url
raise MissingSchema(
requests.exceptions.MissingSchema: Invalid URL 'blumenbach_anatomie_1805_0047.tif': No scheme supplied. Perhaps you meant http://blumenbach_anatomie_1805_0047.tif?
Here's from the respective mets.xml:
<mets:fileGrp USE="OCR-D-GT-SEG-PAGE">
<mets:file MIMETYPE="application/vnd.prima.page+xml" ID="OCR-D-GT-SEG-PAGE_0001">
<mets:FLocat LOCTYPE="URL" xlink:href="GT-PAGE/blumenbach_anatomie_1805_0047.xml"/>
<mets:FLocat xlink:href="OCR-D-GT-SEG-PAGE/OCR-D-GT-SEG-PAGE_0001.xml" LOCTYPE="OTHER" OTHERLOCTYPE="FILE"/>
</mets:file>
...
</mets:fileGrp>
<mets:fileGrp USE="OCR-D-IMG">
<mets:file MIMETYPE="image/tiff" ID="OCR-D-IMG_0001">
<mets:FLocat LOCTYPE="URL" xlink:href="GT-PAGE/blumenbach_anatomie_1805_0047.tif"/>
<mets:FLocat xlink:href="OCR-D-IMG/OCR-D-IMG_0001.tif" LOCTYPE="OTHER" OTHERLOCTYPE="FILE"/>
</mets:file>
...
</mets:fileGrp>
And that's what the PAGE XML looks like:
<PcGts xmlns="http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15" xmlns:xsi=
"http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15
http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15/pagecontent.xsd" pcGtsId="pc-blumenbach_anatomie_1805_0047">
<Metadata>
<Creator>OCR-D</Creator>
<Created>2016-09-20T09:27:46</Created>
<LastChange>2018-04-24T13:53:18</LastChange>
<Page imageFilename="blumenbach_anatomie_1805_0047.tif" imageWidth="1700" imageHeight="2687" type="content">
...
To summarize:
- the METS has spurious URL refs that look like local paths (which never existed in the first place).
- the correct
FLocat/@href
in the METS cannot be resolved, because they do not match the PAGE's@imageFilename
This is urgent: some of our module CIs depend on GT datasets (i.e. must be adapted to the new bag URLs now), but they don't work.
Probably related: #1149
Seems to have been fixed by @tboenig with the newest releases. So was this a problem with the gt-repo-scripts alone (and we can close here), or is there still something wrong with the bagger?
The problem is back with all the released bags after and including https://github.com/OCR-D/gt_structure_text/releases/tag/v1.4.3.
The problem is not with the bagger though: The culprit is this change in the PAGE files. By replacing the @imageFilename
in that way, the bagger's imageFilename substitution rule cannot fire anymore.
But even if the files are repaired in the GT data repo: IMO the bagger really needs to address this:
core/src/ocrd/workspace_bagger.py
Line 104 in 3a69e65
(We do rely on derived images like binarization in some CI tests.)