OCR-D / core

Collection of OCR-related python tools and wrappers from @OCR-D

Home Page:https://ocr-d.de/core/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

`ocrd workspace add`: ORDER-attribute is not set during writing/updating

csidirop opened this issue · comments

If updating a METS without (for whatever reason) the structMap PHYSICAL entry of the page, the ORDER label is not set.

The ORDER attribute is mandatory, isn't it?

In addition, consider whether it would make sense to use this as an additional parameter.

Can you share the workspace and sequence of calls that lead to this issue? For the most part, the METS handling in core should not change what was input, except for newly generated elements.

I.e. if I take an example like https://content.staatsbibliothek-berlin.de/dc/PPN631277528.mets.xml, there are @ORDER and @ORDERLABEL on the mets:div and remain so when serializing.

I'll be happy to add better support for @ORDER once I understand what goes wrong.

Thanks for your response! I'm adding single ALTOs to an existing METS-file.

I'm calling ocrd --log-level INFO workspace add --force --file-grp FULLTEXT --file-id "fulltext-$pageId" --page-id="$pageNum" --mimetype text/xml "$pageId.xml" . This creates a <mets:fileGrp USE="FULLTEXT"> node and a new structMap PHYSICAL-entry with <mets:div TYPE="page" ID="XXX"> is created and the original untouched.

If the ID is known, the `structMap' entry will be updated. This may only be a problem if you use it in a script or with a lot of METS automatically.

Thanks, I understand now. We need a way to either add @ORDER, @ORDERLABEL and @TYPE (any other attributes??) to newly created page-specific mets:div.

Here's two possibilities to add that:

  1. ocrd workspace update-page --order ... --order-label ... --type ... $PAGE_ID which changes/adds the attributes to the mets:div with @ID $PAGE_ID
  2. Add these attributes to ocrd workspace add/bulk-add.

Solution 1. is not too complicated to implement. If it is enough for your use case, I would prefer to leave it at that: You add your files with ocrd workspace add and then once ocrd workspace update-page to set the attributes.

If it is important for your use case to be able to do it as part of ocrd workspace add/bulk-add, then we'll need to find a way to make this as unobtrusive as possible because both add and bulk-add already have a pretty convoluted CLI and API.

I just realized that OcrdMets.set_physical_page_for_file already supports @ORDER and @ORDERLABEL (@TYPE is always page), we just need to expose it to the CLI.

Great, solution 1 should be sufficient for me!

But I think in the long run, the second solution is better. I see that this is not so easy, e.g. if there are multiple attributes set, which in the worst case could match two entries.
But that way one could update the fileSec-entry just with the page number (ORDER) only.

I just realized that OcrdMets.set_physical_page_for_file already supports @ORDER and @ORDERLABEL (@TYPE is always page), we just need to expose it to the CLI.

That's great news!

Solution 1 is in #1134, please let me know if it works for you. Currently supporting ORDER, ORDERLABEL and CONTENIDS. Changing TYPE is possible but will break the data model, so I did not implement that one. If there's more attributes you want to be able to change, let me know.