OCR-D / core

Collection of OCR-related python tools and wrappers from @OCR-D

Home Page:https://ocr-d.de/core/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Suggestion: support split page ranges

MehmedGIT opened this issue · comments

The ocr-d processors accept a page_id parameter which makes it possible to run the processor on a specific page/s or range of pages. However, some workspaces are problematic when it comes to the consistency of the page IDs. Consider the following page IDs in 2 different workspaces:

  1. workspace: PHYS_0001, PHYS_0002, ..., PHYS_XXXX - an ideal case, where page_ids are consistent and sequential, and where simply doing PHYS_YYYY..PHYS_XXXX is enough to split the ranges.
  2. workspace: Phys60047, Phys60048, Phys60050 - this is somewhat a troublesome case where the IDs are not sequential although the pages are.

The physical page id format is different for different workspaces. So, creating a general working pattern is hard and every implementer needs to implement that separately.

Of course, the user could still manually get the physical pages list (with ocrd workspace list-page) and try to divide the ranges themselves. However, it would be more convenient if this is supported as a CLI tool.

Suggestions:

  1. ocrd workspace list-page -R 51..100 should return only the physical pages ranging from 51 to 100 regardless of their page_id format (-R for range). The output should be in a format that the OCR-D processors accept with the page_id parameter.
  2. ocrd workspace list-page -D 4 -C 2 should divide the workspace into 4 equal parts and return the 2nd chunk (-D for division, -C for chunk), in our case with 200 pages - 51..100. That way the user does not even need to know the total amount of pages of a workspace. Again the output is in a format acceptable with page_id parameter of the OCR-D processor.

Another benefit would be for the ocrd_network module. Currently, the workflow endpoint of the processing server accepts the page_id parameter to specify ranges. The user is still obligated to provide an entire list of comma-separated physical pages when their format does not support a range (consider the 2nd workspace above).

Potential issue:
Since the ORDER field of the pages may not be unique in some cases, and the output of the ocrd workspace list-page would probably be based on the input order, this could lead to consistency problems when running several ocr-d processors in parallel and the mets server messes up the order when serializing.

Fairly straightforward to implement, I'll prepare a PR.

Since the ORDER field of the pages may not be unique in some cases, and the output of the ocrd workspace list-page would probably be based on the input order, this could lead to consistency problems when running several ocr-d processors in parallel and the mets server messes up the order when serializing.

We do never add or change the @ORDER attribute unless the user explicitly does that on the commandline. Garbage in, garbage out, is what we try to adhere to wrt. METS.