This repository serves as a guide for digitalizing Arabic documents using Tesseract.
-
Install the python packages using pip
pip install -r requirements.txt
-
Convert the pdf file into multiple pngs
mkdir -p pngs && convert -density 150 -trim PDF_FILE.pdf pngs/page%d.png
.- Note: Replace PDF_FILE.pdf with your file's name.
-
Download the Arabic model to the
tessdata/
directory from the tessdata repository: https://github.com/tesseract-ocr/tessdata/blob/main/ara.traineddata -
Run the OCR generation script:
python run_ocr.py
. -
Inspect the output tsv file and the input pdf file for ad-hoc text preprocessing operations to clean the text.