OCR-D / core

Collection of OCR-related python tools and wrappers from @OCR-D

Home Page:https://ocr-d.de/core/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Invalid structMap produced

mikegerber opened this issue · comments

This might be a bug in ocrd-cis actually, so beware.

We encountered a number of problems elsewhere due to an invalid physical structMap. Here, I managed to reproduce with the latest ocrd:all/maximum Docker image, with the following steps:

  1. Starting with the workspace here: https://ub-backup.bib.uni-mannheim.de/~stweil/quiver-benchmark/workflows/workspaces/reichsanzeiger_random_selected_pages_ocr/data/reichsanzeiger_random/
  2. I removed all filegroups except OCR-D-IMG and OCR-D-GT-SEG-LINE, using ocrd workspace remove-group -rf. → After this, the structMap is OK!
  3. Then I ran ocrd-cis-ocropy-binarize -I OCR-D-IMG -O OCR-D-BIN → After this, the structMap is INVALID

Invalid structMap (multiple divs with same ID) after step 2, shortened to one physical page for emphasis:

  <mets:structMap TYPE="PHYSICAL">
    <mets:div TYPE="physSequence">
      <mets:div TYPE="page" ID="P_1879_45_0344">
        <mets:fptr FILEID="OCR-D-IMG_1879_45_0344"/>
        <mets:fptr FILEID="OCR-D-GT-SEG-LINE_1879_45_0344"/>
      </mets:div>

       ...

      <mets:div TYPE="page" ID="P_1879_45_0344">
        <mets:fptr FILEID="OCR-D-BIN_1879_45_0344.IMG-BIN"/>
      </mets:div>
  
       ...

    </mets:div>
  </mets:structMap>

(I'll upload the full data in the comments)

This causes all kind of breakage all over the place.

What I didn't check yet: if this only breaks with ocrd_cis, maybe @bertsky can share his debugging efforts here. I first had the impression that this breaks with add too, but as I had tried to reproduce a problem encountered by @stweil in OCR-D/quiver-benchmarks#22 it could have always been in ocrd_cis (specific workflow uses this as first step) and I could have easily confused something.

Much simpler way to reproduce:

export OCRD_METS_CACHING=1
git clone https://github.com/OCR-D/assets
cd assets/data/SBB0000F29300010000/data
ocrd-sbb-binarize -I OCR-D-IMG -O OCR-D-BIN -P model default-2021-03-09
eval declare -A XSD_PATHS=($(ocrd bashlib constants XSD_PATHS))
XSD_METS=${XSD_PATHS[$(ocrd bashlib constants XSD_METS_URL)]}
xmllint --schema $XSD_METS --noout mets.xml

The culprit is the caching.

Alright, so it's not ocrd-cis, but definitely a bug in core and a severe one!

(I'll upload the full data in the comments)

https://qurator-data.de/~mike.gerber/2024-02-core-issue-1195-invalid-structMap/

reichsanzeiger_random_selected_pages_ocr.2-after-ocrd-cis-ocropy-binarize.zip is the one with broken METS, after step 2 above.

Absolutely.

And #1193 does not help (yet). It must be the changes to OcrdMets around #1063, though.

It must be the changes to OcrdMets around #1063, though.

Confirmed by bisection. Somewhere here we broke the cache validity.

Workaround when running the ocrd/all:maximum image is disabling the caching by setting OCRD_METS_CACHING=0:

$ env | grep OCRD_METS_CACHING
OCRD_METS_CACHING=0

e.g. by using the docker CLI's --env like this:

$ cat run-ocrd_all-maximum.sh
#!/bin/sh
docker run \
        --user $(id -u) --workdir /data --volume $PWD:/data \
        --volume ocrd-models:/models \
        --env OCRD_METS_CACHING=0 \
        --rm -it \
        ocrd/all:maximum bash

(I'm currently trying to use the official image and this shell is what I am using for the time being.)

  1. Then I ran ocrd-cis-ocropy-binarize -I OCR-D-IMG -O OCR-D-BIN → After this, the structMap is INVALID

Setting OCRD_METS_CACHING=0 fixes this, the structMap is valid after this binarization run.

Note: As I understand it, this only happens with the ocrd/all Docker images, because they enable caching, but (unconfirmed) not when running the naked un-containerized ocrd_all.

I found the cause: In this line …

self._fptr_cache[pageId] = {}
… the fptr cache gets deleted if there was nothing in the page cache for that pageId already (instead of a setdefault).

Note: we should also add test coverage for this. AFAICS, we need a cached add_file under the circumstance that a file for that page already exists, but in a different fileGrp.

Note: we should also add test coverage for this. AFAICS, we need a cached add_file under the circumstance that a file for that page already exists, but in a different fileGrp.

Wouldn't a simple integration test (running a simple bin + seg + ocr workflow) using the ocrd_all images also have caught this?

Wouldn't a simple integration test (running a simple bin + seg + ocr workflow) using the ocrd_all images also have caught this?

Yes, OCR-D/ocrd_all#407

Note: we should also add test coverage for this. AFAICS, we need a cached add_file under the circumstance that a file for that page already exists, but in a different fileGrp.

Wouldn't a simple integration test (running a simple bin + seg + ocr workflow) using the ocrd_all images also have caught this?

Yes, OCR-D/ocrd_all#407

Does this test a simple workflow? I really don't see it?

Note: as long as no v2.62.3 is released on PyPI, the bug is still not solved for users (ocrd/all being affected).