RiccardoTOTI / LLM-PDF-Extractor

LLM-PDF-Parser is a FastAPI-based application that extracts text from PDFs and images and uses NuExtract LLM to extract specific fields based on a given JSON template.

Repository from Github https://github.comRiccardoTOTI/LLM-PDF-ExtractorRepository from Github https://github.comRiccardoTOTI/LLM-PDF-Extractor

πŸ“„ LLM-PDF-Parser

LLM-PDF-Parser is a FastAPI-based application that extracts text from PDFs and images and uses Ollama's LLM to extract specific fields based on a given JSON template. πŸš€

✨ Features

  • πŸ“ Extract text from PDFs and images (JPG, PNG, JPEG) using PyMuPDF and EasyOCR.
  • πŸ€– Leverage AI to extract structured data based on a provided JSON template.
  • ⚑ FastAPI backend for quick and easy integration.
  • πŸ”₯ Supports OCR when text extraction from PDFs is insufficient.
  • πŸ”„ Cross-Origin Resource Sharing (CORS) enabled for flexible frontend integration.

πŸš€ Installation & Setup

1️⃣ Clone the Repository

git clone https://github.com/RiccardoTOTI/LLM-PDF-Extractor.git
cd LLM-PDF-Parser

2️⃣ Install Dependencies

pip install -r requirements.txt

3️⃣ Set Up Environment Variables

Create a .env file and configure the following variables (or set them in your environment):

OLLAMA_URL=http://localhost:11434
OLLAMA_MODEL=iodose/nuextract-v1.5

4️⃣ Run the Application

uvicorn main:app --host 0.0.0.0 --port 8000 --reload

5️⃣ Spin Ollama server

ollama run iodose/nuextract-v1.5

🐳 Docker Support

You can also run the application using Docker.

πŸ— Build and Run with Docker

  1. Build the Docker image:
    docker build -t llm-pdf-parser .
  2. Run the container:
    docker run -p 8000:8000 llm-pdf-parser

πŸ”„ Using Docker Compose

You can use Docker Compose to spin up both the application and Ollama:

  1. Run the services:
    docker-compose up -d
  2. Download the model using the script inside tools directory in the Ollama Container:
    docker exec -it ollama .tools/download_model.sh iodose/nuextract-v1.5

πŸ”₯ API Usage

πŸ“₯ Upload a File & Extract Data

Endpoint: POST /extract

Request:

curl -X 'POST' \
  'http://localhost:8000/extract' \
  -H 'accept: application/json' \
  -H 'Content-Type: multipart/form-data' \
  -F 'file=@yourfile.pdf' \
  -F 'fields={"Patient":{"First Name":"","Last Name":"","Tax Code":"","Doctor":[]}}
'

Response Example:

{
  "extracted_data": {
  "Patient": {
    "First Name": "John",
    "Last Name": "Doe",
    "Tax Code": "ABC123XYZ",
    "Doctor": [ #support list of elements
        "Dr. Smith",
        "Dr. Bean"
    ]
  }
}
}

πŸ— How It Works

  1. Uploads a PDF or Image via FastAPI.
  2. Extracts text using PyMuPDF or EasyOCR (for scanned documents/images).
  3. Sends extracted text to Ollama's LLM, which structures it based on the provided JSON template.
  4. Returns extracted structured data as a JSON response. βœ…

πŸ›  Technologies Used

  • Python 🐍
  • FastAPI ⚑
  • PyMuPDF πŸ“„
  • EasyOCR πŸ”
  • Ollama LLM πŸ€–
  • Uvicorn πŸš€

πŸ† Contributing

Contributions are welcome! Feel free to submit issues or open a pull request. 😊

πŸ“œ License

This project is licensed under the Apache 2.0 License.


πŸ’‘ Have suggestions or need help? Open an issue or reach out! πŸš€

About

LLM-PDF-Parser is a FastAPI-based application that extracts text from PDFs and images and uses NuExtract LLM to extract specific fields based on a given JSON template.

License:Apache License 2.0


Languages

Language:Python 92.7%Language:Dockerfile 7.3%