handwritten-character-recognition handwritten-text-recognition htr ocr optical-character-recognition shadow-removal tesseract tesseract-ocr text-digitisation

Enhancing Tesseract OCR

This is the repository for our Signal, Image and Video course Project (Giovanni Valer and Laurence Bonat).

The Report is available here.

Installation

We used Python 3.12.0 and Tesseract-OCR 5.3.3. See requirements.txt for the required packages.

Methods

The methods folder contains the different experiments of our project. There are different functionalities:

manual_trackbar.py: trackbar in manual mode
autonomous_trackbar.ipynb: trackbar in autonomous mode
automatic_filtering.ipynb: automatic filtering of the text through a specific pipeline
lines_detection.ipynb: automatically detect if a text is on lined/squared paper
squared_paper_ocr.ipynb: HTR on lined/squared paper

Results

In results are the results of all methods. There is the compute_metrics.py script which automatically computes and saves the average accuracy of each method in results/results.txt, (plus some other metrics in results/metrics).

About

Preprocessing methods to enhance Tesseract-OCR in the case of printed text on difficult background, or handwritten text on lined/squared paper.

handwritten-character-recognition handwritten-text-recognition htr ocr optical-character-recognition shadow-removal tesseract tesseract-ocr text-digitisation

MIT License

Languages

Language:Jupyter Notebook 88.0%Language:Python 12.0%