internetarchive / analyze_ocr

Parse OCR result files for pagenos, tables of contents, etc.

Some code for analyzing OCR'ed documents.  It's currently pretty
specific to Internet Archive OCR'd books, but it may be generalizable.

Entry point: analyze_ocr.py - run this against an archive scanned book.

Functionality: find headers/footers, page numbers, tables of contents.

About

Parse OCR result files for pagenos, tables of contents, etc.

Languages

Language:Python 99.3%Language:PHP 0.7%