PaddlePulverizer
Introduction
- page layout analysis of a pdf document
- text reflow of the pdf document for reading on a kindle paperwhite 3
Dependencies
Python 3.7.x ~ 3.8.x due to paddle dependency
Python packages
- pdf processing
PyPDF4
pdf2image
pdf-annotate
- image processing
opencv_contrib_python==4.4.0.46
opencv-python-headless==4.1.2.30
Pillow
Paddle series
- It seems that these packages need to be installed individually.PaddlePaddle
Layout-Parser
PaddleOCR
- others
tqdm
loguru
numpy
pytesseract
python-telegram-bot
(optional, <= 13.15)
Other dependencies
poppler
- the dependency ofpdf2image
packagetesseract
- OCR function for figure caption recognition
Optional functions
Installation
The installation without telegram bot function is as follows:
py -m pip install -r requirements.txt
py -m pip install paddlepaddle==2.1.1 -i https://mirror.baidu.com/pypi/simple
py -m pip install -U https://paddleocr.bj.bcebos.com/whl/layoutparser-0.0.0-py3-none-any.whl
Usage
Command line
Help
See options in details:
python pulverizer.py -h
Page layout analysis
python pulverizer.py yourfile1.pdf [yourfile2.pdf ...] [-c 1] [-p 1 20]
When you run the code for the first time, it will take a while to download model data. After that, page layout analysis will start to work.
Pdf files in the example folder show the result.
Then you could edit the .md
file based on the annotated pdf file (*_box.pdf
or *_annotated.pdf
).
line template of the .md
file
1 x 61.87 697.18 104.68 712.64
pageNumber pageType left bottom right top
pageType
x
for textb
for tablef
for figure
.md
file and reflow the text
Crop pdf(s) based on python pulverizer.py yourfile.pdf [yourfile2.pdf ...] -md [-k 300]
The same pattern (arguments) is applied to all yourfile.pdf
.
Telegram bot
try this bot on Telegram (not available not)
But you can set up one by yourself.
Settings
windows
setx PULVERIZER_BOT_TOKEN "your bot token"
macOS
export ...
Linux
Functions
Basics of Telegram Bot
/start
/help
Core
/pl # page layout analysis
/pp # get the .md and box pdf file
/md # reflow
file manipulations
/gp # get current pdf file name
/sp # set current pdf file name
/ls # list current files in your folder
/xk # send the final reflowed pdf file
/rm # clear your folder
# send file with file name
/sn yourfilepath
# rename?
/rn
Problem
It is very difficult to pack the source code together via pyinstaller
due to the complex structures of paddle(ocr)
package(s).
Issues
- the bottom of rectangle shapes should be lower
-
pdf-annotate
- rectangle shapes have some drift but the pdf cropping is correct -
multiprocess
loses function (change toconcurrent.futures
) - 2023-11-23 - 2023-11-23 - delete the last line of
.md
file - 2022-04-11 - 2022-11-20 -
opencv-python-headless==4.1.2.30
stackoverflow discussion