Invalid structMap produced
mikegerber opened this issue · comments
This might be a bug in ocrd-cis actually, so beware.
We encountered a number of problems elsewhere due to an invalid physical structMap. Here, I managed to reproduce with the latest ocrd:all/maximum Docker image, with the following steps:
- Starting with the workspace here: https://ub-backup.bib.uni-mannheim.de/~stweil/quiver-benchmark/workflows/workspaces/reichsanzeiger_random_selected_pages_ocr/data/reichsanzeiger_random/
- I removed all filegroups except OCR-D-IMG and OCR-D-GT-SEG-LINE, using
ocrd workspace remove-group -rf
. → After this, the structMap is OK! - Then I ran
ocrd-cis-ocropy-binarize -I OCR-D-IMG -O OCR-D-BIN
→ After this, the structMap is INVALID
Invalid structMap (multiple divs with same ID) after step 2, shortened to one physical page for emphasis:
<mets:structMap TYPE="PHYSICAL">
<mets:div TYPE="physSequence">
<mets:div TYPE="page" ID="P_1879_45_0344">
<mets:fptr FILEID="OCR-D-IMG_1879_45_0344"/>
<mets:fptr FILEID="OCR-D-GT-SEG-LINE_1879_45_0344"/>
</mets:div>
...
<mets:div TYPE="page" ID="P_1879_45_0344">
<mets:fptr FILEID="OCR-D-BIN_1879_45_0344.IMG-BIN"/>
</mets:div>
...
</mets:div>
</mets:structMap>
(I'll upload the full data in the comments)
This causes all kind of breakage all over the place.
What I didn't check yet: if this only breaks with ocrd_cis, maybe @bertsky can share his debugging efforts here. I first had the impression that this breaks with add
too, but as I had tried to reproduce a problem encountered by @stweil in OCR-D/quiver-benchmarks#22 it could have always been in ocrd_cis (specific workflow uses this as first step) and I could have easily confused something.
Much simpler way to reproduce:
export OCRD_METS_CACHING=1
git clone https://github.com/OCR-D/assets
cd assets/data/SBB0000F29300010000/data
ocrd-sbb-binarize -I OCR-D-IMG -O OCR-D-BIN -P model default-2021-03-09
eval declare -A XSD_PATHS=($(ocrd bashlib constants XSD_PATHS))
XSD_METS=${XSD_PATHS[$(ocrd bashlib constants XSD_METS_URL)]}
xmllint --schema $XSD_METS --noout mets.xml
The culprit is the caching.
Alright, so it's not ocrd-cis, but definitely a bug in core and a severe one!
(I'll upload the full data in the comments)
https://qurator-data.de/~mike.gerber/2024-02-core-issue-1195-invalid-structMap/
reichsanzeiger_random_selected_pages_ocr.2-after-ocrd-cis-ocropy-binarize.zip
is the one with broken METS, after step 2 above.
Workaround when running the ocrd/all:maximum image is disabling the caching by setting OCRD_METS_CACHING=0
:
$ env | grep OCRD_METS_CACHING
OCRD_METS_CACHING=0
e.g. by using the docker
CLI's --env
like this:
$ cat run-ocrd_all-maximum.sh
#!/bin/sh
docker run \
--user $(id -u) --workdir /data --volume $PWD:/data \
--volume ocrd-models:/models \
--env OCRD_METS_CACHING=0 \
--rm -it \
ocrd/all:maximum bash
(I'm currently trying to use the official image and this shell is what I am using for the time being.)
- Then I ran
ocrd-cis-ocropy-binarize -I OCR-D-IMG -O OCR-D-BIN
→ After this, the structMap is INVALID
Setting OCRD_METS_CACHING=0 fixes this, the structMap is valid after this binarization run.
Note: As I understand it, this only happens with the ocrd/all Docker images, because they enable caching, but (unconfirmed) not when running the naked un-containerized ocrd_all.
I found the cause: In this line …
core/src/ocrd_models/ocrd_mets.py
Line 779 in 3a1b3a2
setdefault
).See new #1193
Note: we should also add test coverage for this. AFAICS, we need a cached add_file
under the circumstance that a file for that page already exists, but in a different fileGrp.
Note: we should also add test coverage for this. AFAICS, we need a cached
add_file
under the circumstance that a file for that page already exists, but in a different fileGrp.
Wouldn't a simple integration test (running a simple bin + seg + ocr workflow) using the ocrd_all images also have caught this?
Wouldn't a simple integration test (running a simple bin + seg + ocr workflow) using the ocrd_all images also have caught this?
Yes, OCR-D/ocrd_all#407
Note: we should also add test coverage for this. AFAICS, we need a cached add_file under the circumstance that a file for that page already exists, but in a different fileGrp.
Wouldn't a simple integration test (running a simple bin + seg + ocr workflow) using the ocrd_all images also have caught this?
Yes, OCR-D/ocrd_all#407
Does this test a simple workflow? I really don't see it?
Note: as long as no v2.62.3 is released on PyPI, the bug is still not solved for users (ocrd/all being affected).