pdf python webserver opencv pytesseract pytesseract-ocr socketio

Parse tables from PDF

This tool was created to automize the process of pulling tables from PDF documents. It goes through all the pages, recognises where tables are and then proceeds to transfer them to csv. Using pytesseract it parses text from each cell and determines its position in the table.

You can use this tool by either directly running the python script along with some flags or by running a Web server that will host a web page for uploading files to procees them on server and return the csv files. Whilst displaying the current progress.

Here's the front page

While processing, it displays processing status for each page and gives you option to download each one individually, or altogether at the end

Example results

Input table as an image in PDF file

Parsed table

Installation

Required python libraries

pip install pytesseract opencv-python tqdm progressbar pdf2image pymupdf fitz frontend tools

# Optional for webserver
pip install aiohttp eventlet

Tesseract installation on Linux using apt

sudo apt install tesseract-ocr tesseract-ocr-rus

Linux using pacman

sudo pacman -S poppler
sudo pacman -S tesseract  # Select needed language, for example rus - 94
sudo pacman -S tesseract-data-rus tesseract-data-eng

Windows

Download tesseract exe from https://github.com/UB-Mannheim/tesseract/wiki.
Install this exe in C:\Program Files (x86)\Tesseract-OCR
Open virtual machine command prompt in windows or anaconda prompt.
Run pip install pytesseract

Running

Running locally

From PDF file

python3 recognise.py --client --input example/rencap2021.pdf --limit 10

And from remote PDF file

python3 recognise.py --client --remote https://github.com/pavtiger/Parse-tables-from-PDF/raw/master/example/rencap2021.pdf --limit 10

All data will output to output/ directory. You can find example results in example/.

You can also change the render quality (>= 200)

python3 recognise.py --client --input example/rencap2021.pdf --limit 10 --quality 300

Running web server

python3 recognise.py --server

All available flags:

input - Path to input pdf file to convert
remote - Link to a remote location from where to obtain PDF file
limit - Process only first N pages. (-1 if all)
quality - PDF page render quality (default 200). Increasing will consume more RAM, but going under 200 is highly unadvised. This will cause recongision errors. For reference, 300 requires 8gb of RAM

About

A tool that automizes the process of pulling data tables from PDF documents where they are as scans

https://pdf.pavtiger.com

pdf python webserver opencv pytesseract pytesseract-ocr socketio

Languages

Language:Python 46.7%Language:JavaScript 30.9%Language:CSS 15.3%Language:HTML 7.2%