madnight / pdf-layout-text-stripper

Converts a pdf file into a text file while keeping the layout of the original pdf.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

PDFLayoutTextStripper as docker container command-line utility

license Code Climate Issue Count

Converts a PDF file into a text file while keeping the layout of the original PDF. Useful to extract the content from a table or a form in a PDF file. PDFLayoutTextStripper is a subclass of PDFTextStripper class (from the Apache PDFBox library).

  • Use cases
  • How to use

Use cases

Data extraction from a table in a PDF file example

Data extraction from a form in a PDF file example

How to use

# i do it myself
docker build -t pdf-layout-text-stripper .
docker run -v $(pwd):/app pdf-layout-text-stripper "sample.pdf"

# i'm lazy
docker run -v $(pwd):/app madnight/pdf-layout-text-stripper "sample.pdf"

About

Converts a pdf file into a text file while keeping the layout of the original pdf.


Languages

Language:Java 98.9%Language:Dockerfile 1.1%