kba / ocr-schemas

Convert and transform various OCR formats (hOCR, ALTO, PAGE, FineReader)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ocr-schemas

Build Status

Validate and transform between OCR file formats (hOCR, ALTO, PAGE, FineReader)

Screenshot GUI

Convert between Tesseract hOCR and ALTO XML 2.0/2.1 using XSL stylesheets

This project provides an installation path and command line interface for the stylesheets developed by @filak.

Installation

To install system-wide to /usr/local:

sudo make install

To install without sudo to your home directory:

make install PREFIX=$HOME/.local

If $HOME/.local/bin is not in your PATH, add this to your shell startup file (e.g. ~/.bashrc or ~/.zshrc):

export PATH="$HOME/.local/bin $PATH"

The web application has a PHP backed. You can deploy it on any PHP-capable server by copying the web folder somewhere below the document root of your server, e.g. /var/www/html for Apache on Debian/Ubuntu:

sudo -u www-data cp -r web /var/www/html/ocr-schema

In this example the GUI would be available under http://localhost/ocr-schema/.

Usage

The project offers two functionalities, which can be accessd via a command line script (CLI), using a web interface (GUI) or in you own tools (API)

CLI

  • ocr-transform: Transformation of OCR output between OCR formats
  • ocr-validate: Validation of OCR output against OCR format schemas

API

Transformation

Transformation CLI

Usage: ocr-transform [-dl] <input-fmt> <output-fmt> [<input> [<output>]] [-- <saxon_opts>]

For example, you can transform an ALTO XML to a hOCR file with:

ocr-transform alto hocr sample.xml sample.hocr

Or convert from ALTO XML (version 2.1) to hOCR with:

ocr-transform alto2.1 hocr sample.alto sample.hocr

You can also pass arguments directly to the Saxon CLI by passing them after a double dash (--). For example, to set the foo parameter to bar:

ocr-transform alto hocr sample.xml sample.hocr -- foo=bar

Try ocr-transform -h to get an overview:

Usage: ocr-transform [-dl] <input-fmt> <output-fmt> [<input> [<output>]] [-- <saxon_opts>]
Input formats:
- 'alto'
- 'hocr'
Output formats:
- 'alto2.0'
- 'alto2.1'
- 'hocr'
Saxon-HE 9.7.0.4J from Saxonica
Java version 1.7.0_95
Usage: see http://www.saxonica.com/html/documentation/using-xsl/commandline.html
Options available: -? -a -catalog -config -cr -diag -dtd -ea -expand -explain -export -ext -im -init -it -l -license -m -nogo -now -o -opt -or -outval -p -pack -quit -r -repeat -s -sa -scmin -strip -t -T -threads -TJ -TP -traceout -tree -u -val -versionmsg -warnings -x -xi -xmlversion -xsd -xsdversion -xsiloc -xsl -xsltversion -y
Use -XYZ:? for details of option XYZ
Params: 
param=value           Set stylesheet string parameter
+param=filename       Set stylesheet document parameter
?param=expression     Set stylesheet parameter using XPath
!param=value          Set serialization parameter

Transformation GUI

Select the Transform menu option. Choose a URL, an input and an output format. Click Transform.

Transformation API

The stylesheets are installed in $PREFIX/share/ocr-schemas/xslt and can be used directly in your scripts and software. You will need to use an XSLT 2.0 capable stylesheet transformer.

Supported Transformations

From ╲ To hOCR ALTO PAGEXML FineReader
hOCR - ✖️ ✖️
ALTO ✖️ ✖️ ✖️
PAGE ✖️ ✖️ - ✖️
FineReader ✖️ ✖️ ✖️ -

Validation

Usage: ocr-validate [-dh] <schema> <file>

Validation CLI

For example, to validate an XML file againt the ALTO 3.1 schema:

ocr-validate alto-3-1 myFile.alto

Validation GUI

Select the Validate menu option. Choose a URL and an schema. Click Validate.

Validation API

The XSD files are installed under $PREFIX/share/ocr-schemas/xsd

Supported Validation Formats

hOCR ALTO PAGEXML FineReader
Validation ✖️

License

The XSL stylesheets for hOCR-ALTO and ALTO-hOCR transformation are licensed Creative Commons Attribution-ShareAlike 4.0 International.(CC BY-SA 4.0).

Projects included during the installation process (in ./vendor):

About

Convert and transform various OCR formats (hOCR, ALTO, PAGE, FineReader)

License:MIT License


Languages

Language:XSLT 44.4%Language:HTML 18.4%Language:Shell 14.7%Language:Makefile 7.2%Language:PHP 7.2%Language:JavaScript 7.0%Language:CSS 1.2%