OCR

built & tested using Python 3.11.2

Minimal version of Python script that finds PDF files in speficied directory, converts them into .txt files using Python Tesseract & saves final files in same directory as original PDF files.

Installation (on windows 10)

clone repo
enter repo directory: cd ocr
install Python tesseract
create virtual environment: py -m venv venv
activate virtual environment: venv\Scripts\activate.bat
update pip: py -m pip install --upgrade pip
install requirements: pip install -r requirements.txt
run program as described below (Usage)

Usage (on windows 10)

run py ocr.py
follow instructions & prompts of program

Quickstart

run py ocr.py
press Enter to use sample PDFs in ./samplePDFs subdirectory by default

What happens

ocr.py creates .txt files with content of all PDFs in given directory

for each PDF in target directory (default: ./samplePDFs): pdf2image module used to convert PDFs in into images
- for each image
  - pytesseract module used to convert text in image to string, then appends text to .txt file with name of original PDF & saves alongside original PDF
process takes up to 10 minutes
content of ./samplePDFs expected to look like ./samplePDFsResult eventually

Resources

sample PDFs

Limitations / Known Issues

potentially inaccurate - depending on quality, structure & content of input PDFs (images, charts, ...)

Potential Improvements

adjust OCR settings to real world input PDFs (to achieve best results for expected input)
create REST API to get share results with clients (long-term?)
- authentication & encryption (e.g. using JSON Web Token)

About

Python script that converts PDFs into .txt files

Languages

Language:C 54.0%Language:C++ 42.0%Language:Roff 2.9%Language:Makefile 0.6%Language:CMake 0.5%Language:Python 0.0%