Here is a demo for PDF parser (Including OCR, object detection tools). PDF module recognition, extraction of multi-level headings, and more.
Firstly, I strongly recommend testing it on Linux.
pip install -r requirements
pip install "unstructured[pdf]"
sudo apt install tesseract-ocr
sudo apt install libtesseract-dev
# using layoutparser tool and download the CV models (Detectron2)
pip install layoutparser torchvision && pip install "git+https://github.com/facebookresearch/detectron2.git@v0.5#egg=detectron2"
# layoutparser also supports paddle tool
pip install "layoutparser[paddledetection]"
For unstructured installation, please refer to here. More details in layoutparser.
# Extraction of Multi-level Headings
python multi_title.py
# Extraction other things
python parser.py
# Note that the test files used in multi_title.py I have generated can be obtained from some tools in parser.py ('23.2307.14893.json' is a result from unstructured, 'test2_photo' is from pdf2image tool.)
Here is a detailed Chinese blog. I apologize; due to project constraints, I can only share a portion of the code. However, feel free to ask any questions.