PDFparser

切换中文

Here is a demo for PDF parser (Including OCR, object detection tools). PDF module recognition, extraction of multi-level headings, and more.

Requirements

Firstly, I strongly recommend testing it on Linux.

pip install -r requirements
pip install "unstructured[pdf]"

sudo apt install tesseract-ocr
sudo apt install libtesseract-dev

# using layoutparser tool and download the CV models (Detectron2)
pip install layoutparser torchvision && pip install "git+https://github.com/facebookresearch/detectron2.git@v0.5#egg=detectron2"

# layoutparser also supports paddle tool 
pip install "layoutparser[paddledetection]"

For unstructured installation, please refer to here. More details in layoutparser.

How to use

# Extraction of Multi-level Headings
python multi_title.py

# Extraction other things
python parser.py

# Note that the test files used in multi_title.py I have generated can be obtained from some tools in parser.py ('23.2307.14893.json' is a result from unstructured, 'test2_photo' is from pdf2image tool.)

Visualization of Extracted Multi-level Headings

Notes

Here is a detailed Chinese blog. I apologize; due to project constraints, I can only share a portion of the code. However, feel free to ask any questions.

Reference

About

Here is a demo for PDF parser (Including OCR, object detection tools)

Apache License 2.0

Languages

Language:Python 100.0%