jrochkind / archive-hocr-tools

Efficient hOCR tooling

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

archive-hocr-tools

This repostory contains a python package to ease hocr parsing in a streaming manner. The library is called hocr.

It also contains various tools:

  • hocr-combine-stream: A tool to combine many hocr files into a big hocr file. Used internally to combine tesseract per-page results into a larger hocr resulting file for an entire book.
  • hocr-fold-chars: A tool to transform a per-character hocr file into a per-word hocr file.

About

Efficient hOCR tooling

License:Other


Languages

Language:Python 100.0%