This project allows you to extract images, text, metadata and text style from PDF files.
- Extract images from PDF files.
- Extract font style and size.
- Extract text.
- Extract metadata.
- Extract all information.
- Python 3.x
- PyMuPDF (
pip install PyMuPDF
)
Clone this repository and install dependencies using PIP
.
git clone https://github.com/JonasChristiano/pdf-extractor
cd pdf-extractor
pip install -r requirements.txt
Execute the script.
python3 pdf_extract.py path/to/file.pdf --extract [images|fonts|text|metadata|all] --pages 0 1 2 --output_folder output/folder
- --extract: What you want to extract (images, fonts, text, metadata, all). Default all.
- --pages: Specify the pages from which you want to extract the information (ex: 0 1 2). Optional.
- --output_folder: Path to output folder. Optional.
The help.
python3 pdf_extract.py --help
If you want to contribute to PDF Extractor, follow these steps:
- Fork this repository.
- Create a branch for your feature (git checkout -b my-feature).
- Commit your changes (git commit -m "Add my feature").
- Push to the branch (git push origin my-feature).
- Open a Pull Request.
Please ensure that you follow the Conventional Commits commit pattern when making your commits.
License This project is licensed under the MIT license - see the LICENSE file for details.