OCR-D / core

Collection of OCR-related python tools and wrappers from @OCR-D

Home Page:https://ocr-d.de/core/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

METS Server based page paralellism for `ocrd process`

kba opened this issue · comments

          BTW, we could also provide this per-page parallelism recipe in core via Python. For the user, it could then look like

ocrd process --jobs 4 --timeout 2m --on-error=empty

Originally posted by @bertsky in OCR-D/ocrd-demo-mets-server#3 (comment)

To elaborate:

  • add an option --jobs to ocrd process which would split the workspace into per-page pipelines synchronised via METS server and managed by Python's builtin multiprocessing facilities.
    → could also offer additional options (splitting up into chunks instead of pages...)
  • add another option --timeout, applicable to the lowest substep (i.e. whole-workspace single-processor call normally, single-page single-processor call in parallel case)
    → now merely as a stopgap, later to be implemented in Processor.process_page and Processor.process_workspace when we have the new processor API
  • add another option --on-error offering various options (raise, ignore, skip, empty)
    → now merely as a stopgap, later to be implemented in Processor.process_page and Processor.process_workspace when we have the new processor API including error handling