OCR-D / core

Collection of OCR-related python tools and wrappers from @OCR-D

Home Page:https://ocr-d.de/core/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

bagger creates invalid URL refs

bertsky opened this issue · comments

We now have OCR-D GT under Github (the old KIT repo has been down for a while, so this is the only place to get the data). It gets created via @tboenig's gt-repo-template, which uses the bagger to create the bagit zips.

Unfortunately, these bags are unusable:

Traceback (most recent call last):
  File "/venv38/bin/ocrd-cis-ocropy-recognize", line 33, in <module>
    sys.exit(load_entry_point('ocrd-cis', 'console_scripts', 'ocrd-cis-ocropy-recognize')())
  File "/venv38/lib/python3.8/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/venv38/lib/python3.8/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/venv38/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/venv38/lib/python3.8/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/ocrd_cis/ocropy/cli.py", line 48, in ocrd_cis_ocropy_recognize
    return ocrd_cli_wrap_processor(OcropyRecognize, *args, **kwargs)
  File "/venv38/lib/python3.8/site-packages/ocrd/decorators/__init__.py", line 133, in ocrd_cli_wrap_processor
    run_processor(processorClass, mets_url=mets, workspace=workspace, **kwargs)
  File "/venv38/lib/python3.8/site-packages/ocrd/processor/helpers.py", line 133, in run_processor
    raise err
  File "/venv38/lib/python3.8/site-packages/ocrd/processor/helpers.py", line 130, in run_processor
    processor.process()
  File "/ocrd_cis/ocropy/recognize.py", line 162, in process
    page_image, page_coords, _ = self.workspace.image_from_page(
  File "/venv38/lib/python3.8/site-packages/ocrd/workspace.py", line 636, in image_from_page
    page_image_info = self.resolve_image_exif(page.imageFilename)
  File "/venv38/lib/python3.8/site-packages/ocrd/workspace.py", line 463, in resolve_image_exif
    with download_temporary_file(image_url) as f:
  File "/usr/lib/python3.8/contextlib.py", line 113, in __enter__
    return next(self.gen)
  File "/venv38/lib/python3.8/site-packages/ocrd/workspace.py", line 53, in download_temporary_file
    with requests.get(url) as r:
  File "/venv38/lib/python3.8/site-packages/requests/api.py", line 73, in get
    return request("get", url, params=params, **kwargs)
  File "/venv38/lib/python3.8/site-packages/requests/api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
  File "/venv38/lib/python3.8/site-packages/requests/sessions.py", line 573, in request
    prep = self.prepare_request(req)
  File "/venv38/lib/python3.8/site-packages/requests/sessions.py", line 484, in prepare_request
    p.prepare(
  File "/venv38/lib/python3.8/site-packages/requests/models.py", line 368, in prepare
    self.prepare_url(url, params)
  File "/venv38/lib/python3.8/site-packages/requests/models.py", line 439, in prepare_url
    raise MissingSchema(
requests.exceptions.MissingSchema: Invalid URL 'blumenbach_anatomie_1805_0047.tif': No scheme supplied. Perhaps you meant http://blumenbach_anatomie_1805_0047.tif?

Here's from the respective mets.xml:

      <mets:fileGrp USE="OCR-D-GT-SEG-PAGE">
         <mets:file MIMETYPE="application/vnd.prima.page+xml" ID="OCR-D-GT-SEG-PAGE_0001">
            <mets:FLocat LOCTYPE="URL" xlink:href="GT-PAGE/blumenbach_anatomie_1805_0047.xml"/>
           <mets:FLocat xlink:href="OCR-D-GT-SEG-PAGE/OCR-D-GT-SEG-PAGE_0001.xml" LOCTYPE="OTHER" OTHERLOCTYPE="FILE"/>
         </mets:file>
...
     </mets:fileGrp>
      <mets:fileGrp USE="OCR-D-IMG">
         <mets:file MIMETYPE="image/tiff" ID="OCR-D-IMG_0001">
            <mets:FLocat LOCTYPE="URL" xlink:href="GT-PAGE/blumenbach_anatomie_1805_0047.tif"/>
            <mets:FLocat xlink:href="OCR-D-IMG/OCR-D-IMG_0001.tif" LOCTYPE="OTHER" OTHERLOCTYPE="FILE"/>
         </mets:file>
...
     </mets:fileGrp>

And that's what the PAGE XML looks like:

<PcGts xmlns="http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15" xmlns:xsi=
"http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15
 http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15/pagecontent.xsd" pcGtsId="pc-blumenbach_anatomie_1805_0047">
        <Metadata>
        <Creator>OCR-D</Creator>
        <Created>2016-09-20T09:27:46</Created>
        <LastChange>2018-04-24T13:53:18</LastChange>
       <Page imageFilename="blumenbach_anatomie_1805_0047.tif" imageWidth="1700" imageHeight="2687" type="content">
...

To summarize:

  1. the METS has spurious URL refs that look like local paths (which never existed in the first place).
  2. the correct FLocat/@href in the METS cannot be resolved, because they do not match the PAGE's @imageFilename

This is urgent: some of our module CIs depend on GT datasets (i.e. must be adapted to the new bag URLs now), but they don't work.

Probably related: #1149

Seems to have been fixed by @tboenig with the newest releases. So was this a problem with the gt-repo-scripts alone (and we can close here), or is there still something wrong with the bagger?

The problem is back with all the released bags after and including https://github.com/OCR-D/gt_structure_text/releases/tag/v1.4.3.

The problem is not with the bagger though: The culprit is this change in the PAGE files. By replacing the @imageFilename in that way, the bagger's imageFilename substitution rule cannot fire anymore.

But even if the files are repaired in the GT data repo: IMO the bagger really needs to address this:

# TODO replace AlternativeImage, recursively...

(We do rely on derived images like binarization in some CI tests.)