dokosho02 / paddlePulverizer

page layout analysis, ready to use, a wrapper for PaddleOCR

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

PaddlePulverizer

Introduction

  1. page layout analysis of a pdf document
  2. text reflow of the pdf document for reading on a kindle paperwhite 3

Dependencies

Python 3.7.x ~ 3.8.x due to paddle dependency

Python packages

  • pdf processing
    • PyPDF4
    • pdf2image
    • pdf-annotate
  • image processing
    • opencv_contrib_python==4.4.0.46
    • opencv-python-headless==4.1.2.30
    • Pillow
  • Paddle series - It seems that these packages need to be installed individually.
    • PaddlePaddle
    • Layout-Parser
    • PaddleOCR
  • others

Other dependencies

Optional functions

  • k2pdfopt - reflow of pdf text file
    • Ubuntu - sudo apt-get install k2pdfopt -y

Installation

The installation without telegram bot function is as follows:

py -m pip install -r requirements.txt
py -m pip install paddlepaddle==2.1.1 -i https://mirror.baidu.com/pypi/simple
py -m pip install -U https://paddleocr.bj.bcebos.com/whl/layoutparser-0.0.0-py3-none-any.whl

Usage

Command line

Help

See options in details:

python pulverizer.py -h

Page layout analysis

python pulverizer.py yourfile1.pdf [yourfile2.pdf ...] [-c 1] [-p 1 20]

When you run the code for the first time, it will take a while to download model data. After that, page layout analysis will start to work.

Pdf files in the example folder show the result.

Then you could edit the .md file based on the annotated pdf file (*_box.pdf or *_annotated.pdf).

line template of the .md file

1	x	61.87	697.18	104.68	712.64
pageNumber pageType left bottom right top
  • pageType
    • x for text
    • b for table
    • f for figure

Crop pdf(s) based on .md file and reflow the text

python pulverizer.py yourfile.pdf [yourfile2.pdf ...] -md [-k 300]

The same pattern (arguments) is applied to all yourfile.pdf.

Telegram bot

try this bot on Telegram (not available not)

But you can set up one by yourself.

Settings

windows
setx PULVERIZER_BOT_TOKEN "your bot token"
macOS
export ...
Linux

Functions

Basics of Telegram Bot
/start
/help
Core
/pl    # page layout analysis
/pp    # get the .md and box pdf file
/md    # reflow
file manipulations
/gp    # get current pdf file name
/sp    # set current pdf file name
/ls    # list current files in your folder
/xk    # send the final reflowed pdf file
/rm    # clear your folder
# send file with file name
/sn yourfilepath  

# rename?
/rn

Problem

It is very difficult to pack the source code together via pyinstaller due to the complex structures of paddle(ocr) package(s).

Issues

  • the bottom of rectangle shapes should be lower
  • pdf-annotate - rectangle shapes have some drift but the pdf cropping is correct
  • multiprocess loses function (change to concurrent.futures) - 2023-11-23 - 2023-11-23
  • delete the last line of .md file - 2022-04-11 - 2022-11-20
  • opencv-python-headless==4.1.2.30 stackoverflow discussion

References

About

page layout analysis, ready to use, a wrapper for PaddleOCR


Languages

Language:Python 100.0%