PDF Extractor

This project allows you to extract images, text, metadata and text style from PDF files.

Functionality

Clone this repository and install dependencies using PIP.

git clone https://github.com/JonasChristiano/pdf-extractor
cd pdf-extractor
pip install -r requirements.txt

Execute the script.

python3 pdf_extract.py path/to/file.pdf --extract [images|fonts|text|metadata|all] --pages 0 1 2 --output_folder output/folder

--extract: What you want to extract (images, fonts, text, metadata, all). Default all.
--pages: Specify the pages from which you want to extract the information (ex: 0 1 2). Optional.
--output_folder: Path to output folder. Optional.

The help.

python3 pdf_extract.py --help

If you want to contribute to PDF Extractor, follow these steps:

Please ensure that you follow the Conventional Commits commit pattern when making your commits.

License This project is licensed under the MIT license - see the LICENSE file for details.

This project allows you to extract images, text, metadata and text style from PDF files.

MIT License

Language:Python 100.0%