Suggestion: support split page ranges
MehmedGIT opened this issue · comments
The ocr-d processors accept a page_id
parameter which makes it possible to run the processor on a specific page/s or range of pages. However, some workspaces are problematic when it comes to the consistency of the page IDs. Consider the following page IDs in 2 different workspaces:
- workspace: PHYS_0001, PHYS_0002, ..., PHYS_XXXX - an ideal case, where page_ids are consistent and sequential, and where simply doing
PHYS_YYYY..PHYS_XXXX
is enough to split the ranges. - workspace: Phys60047, Phys60048, Phys60050 - this is somewhat a troublesome case where the IDs are not sequential although the pages are.
The physical page id format is different for different workspaces. So, creating a general working pattern is hard and every implementer needs to implement that separately.
Of course, the user could still manually get the physical pages list (with ocrd workspace list-page
) and try to divide the ranges themselves. However, it would be more convenient if this is supported as a CLI tool.
Suggestions:
ocrd workspace list-page -R 51..100
should return only the physical pages ranging from 51 to 100 regardless of theirpage_id
format (-R for range). The output should be in a format that the OCR-D processors accept with thepage_id
parameter.ocrd workspace list-page -D 4 -C 2
should divide the workspace into 4 equal parts and return the 2nd chunk (-D for division, -C for chunk), in our case with 200 pages -51..100
. That way the user does not even need to know the total amount of pages of a workspace. Again the output is in a format acceptable withpage_id
parameter of the OCR-D processor.
Another benefit would be for the ocrd_network
module. Currently, the workflow endpoint of the processing server accepts the page_id
parameter to specify ranges. The user is still obligated to provide an entire list of comma-separated physical pages when their format does not support a range (consider the 2nd workspace above).
Potential issue:
Since the ORDER
field of the pages may not be unique in some cases, and the output of the ocrd workspace list-page
would probably be based on the input order, this could lead to consistency problems when running several ocr-d processors in parallel and the mets server messes up the order when serializing.
Fairly straightforward to implement, I'll prepare a PR.
Since the
ORDER
field of the pages may not be unique in some cases, and the output of theocrd workspace list-page
would probably be based on the input order, this could lead to consistency problems when running several ocr-d processors in parallel and the mets server messes up the order when serializing.
We do never add or change the @ORDER
attribute unless the user explicitly does that on the commandline. Garbage in, garbage out, is what we try to adhere to wrt. METS.