Documentation

Octopy

Command line tool for Kraken text segmentation and recognition. Made for OCR4all.

Setup

Note

Tested Python version: 3.11.x

Virtual environment (PyEnv)

pyenv install 3.11.9
pyenv virtualenv 3.11.9 octopy
pyenv activate octopy

Kraken

Clone modified Kraken version

git clone --single-branch --branch octopy https://github.com/jahtz/kraken

Install modified package

pip install ./kraken

Octopy

Clone repository

git clone --recurse-submodules https://github.com/jahtz/octopy

CUDA/cuDNN

Note

CUDA >=12.1 , v12.5 working.

For GPU computation:

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/cuda/lib:/opt/cuda/lib64

Usage

python octopy --help

Segtrain

> python octopy segtrain --help
Usage: octopy segtrain [OPTIONS] GROUND_TRUTH...

  Train a segmentation model using Kraken.

  This tool allows you to train a segmentation model from PageXML files, with
  options for using pre-trained models,  customizing evaluation, and setting
  advanced training configurations.

  Images should be stored in the same directory as their corresponding PageXML
  files.

  GROUND_TRUTH: List of PageXML files containing ground truth data for
  training.  Supports file paths, wildcards, or directories (used with the -g
  option).

Options:
  --help                         Show this message and exit.
  -o, --output DIRECTORY         Directory where training checkpoints and
                                 results will be saved. The directory will be
                                 created if it does not already exist.
                                 [required]
  -g, --gt-glob TEXT             Specify a glob pattern to match PageXML files
                                 within directories passed to GROUND_TRUTH.
                                 [default: *.xml]
  -e, --evaluation DIRECTORY     Optional directory containing files for model
                                 evaluation.
  -e, --eval-glob TEXT           Specify a glob pattern to match evaluation
                                 PageXML files in the --evaluation directory.
                                 [default: *.xml]
  -p, --partition FLOAT          If no evaluation directory is provided, this
                                 option splits the GROUND_TRUTH files into
                                 training and evaluation sets. Default split
                                 is 90% training, 10% evaluation.  [default:
                                 0.9]
  -m, --model FILE               Optional model to start from. If not set,
                                 training will start from scratch
  -n, --name TEXT                Name of the output model. Results in
                                 filenames such as <name>_best.mlmodel
                                 [default: foo]
  -d, --device TEXT              Select the device to use for training (e.g.,
                                 `cpu`, `cuda:0`). Refer to PyTorch
                                 documentation for supported devices.
                                 [default: cpu]
  --frequency FLOAT              Frequency at which to evaluate model on
                                 validation set. If frequency is greater than
                                 1, it must be an integer, i.e. running
                                 validation every n-th epoch.  [default: 1.0]
  -q, --quit [early|fixed]       Stop condition for training. Choose `early`
                                 for early stopping or `fixed` for a fixed
                                 number of epochs.  [default: fixed]
  -l, --lag INTEGER RANGE        For early stopping, the number of validation
                                 steps without improvement (measured by
                                 val_mean_iu) to wait before stopping.
                                 [default: 10; x>=1]
  --epochs INTEGER               Number of epochs to train when using fixed
                                 stopping.  [default: 50]
  --min-epochs INTEGER           Minimum number of epochs to train before
                                 early stopping is allowed.  [default: 0]
  -r, --resize [union|new|fail]  Controls how the model's output layer is
                                 resized if the training data contains
                                 different classes. `union` adds new classes,
                                 `new` resizes to match the training data, and
                                 `fail` aborts training if there is a
                                 mismatch. `new` is recommended.  [default:
                                 fail]
  --workers INTEGER RANGE        Number of worker processes for CPU-based
                                 training.  [default: 1; x>=1]
  --threads INTEGER RANGE        Number of threads to use for CPU-based
                                 training.  [default: 1; x>=1]
  --suppress-regions             Disable region segmentation training.
  --suppress-baselines           Disable baseline segmentation training.
  -vr, --valid-regions TEXT      Comma-separated list of valid regions to
                                 include in the training. This option is
                                 applied before region merging.
  -vb, --valid-baselines TEXT    Comma-separated list of valid baselines to
                                 include in the training. This option is
                                 applied before baseline merging.
  -mr, --merge-regions TEXT      Region merge mapping. One or more mappings of
                                 the form `$target:$src`, where $src is merged
                                 into $target.
  -mb, --merge-baselines TEXT    Baseline merge mapping. One or more mappings
                                 of the form `$target:$src`, where $src is
                                 merged into $target.
  -v, --verbose                  Set verbosity level for logging. Use -vv for
                                 maximum verbosity (levels 0-2).

segment

> python octopy segment --help
Usage: octopy segment [OPTIONS] [IMAGES]...

  Segment images using Kraken and save the results as XML files (PageXML
  format).

  This tool processes one or more images and segments them using a trained
  Kraken model.  The segmented results are saved as XML files, corresponding
  to each input image.

  IMAGES: List of image files to be segmented. Supports multiple file paths,
  wildcards,  or directories (when used with the -g option).

Options:
  --help                          Show this message and exit.
  -g, --glob TEXT                 Specify a glob pattern to match image files
                                  when processing directories in IMAGES.
                                  [default: *]
  -m, --model FILE                Path to a custom segmentation model. If not
                                  provided, the default Kraken model is used.
                                  Multiple models can be specified.
  -o, --output DIRECTORY          Directory to save the output PageXML files.
                                  Defaults to the parent directory of each
                                  input image.
  --text-direction [l2r|r2l|t2b|b2t]
                                  Set the text direction for segmentation.
                                  [default: l2r]
  -s, --suffix TEXT               Append a suffix to the output PageXML file
                                  names. For example, using `.seg.xml` results
                                  in filenames like `imagename.seg.xml`
                                  [default: .xml]
  -c, --creator TEXT              Specify the creator of the PageXML file.
                                  This can be useful for tracking the origin
                                  of segmented files.
  -d, --device TEXT               Specify the device for running the model
                                  (e.g., `cpu`, `cuda:0`). Refer to PyTorch
                                  documentation for supported devices.
                                  [default: cpu]
  --default-polygon <INTEGER INTEGER INTEGER INTEGER>...
                                  If the polygonizer fails to create a polygon
                                  around a baseline, use this option to create
                                  a default polygon instead of discarding the
                                  baseline. The offsets are defined as left,
                                  top, right, and bottom
  --sort-lines                    Sort text lines in each TextRegion based on
                                  their centroids according to the specified
                                  text direction. (Feature not yet
                                  implemented)
  --drop-empty                    Automatically drop empty TextRegions from
                                  the output. (Feature not yet implemented)

Licenses

Octopy:
This project is using Apache-2.0 Open Source license.
Kraken:
Octopy is using a modified version of Kraken v5.2.9.
Kraken is licensed under Apache-2.0 Open Source license.
Changes:
- blla.py, segmentation.py: Adding default_polygon option if polygonizer fails.

ZPD

Developed at Centre for Philology and Digitality (ZPD), University of Würzburg.

jahtz / octopy

Documentation

Octopy

Setup

Virtual environment (PyEnv)

Kraken

Octopy

CUDA/cuDNN

Usage

Segtrain

segment

Licenses

ZPD

About

Languages