docker docker-compose image-recognition kubernetes llm microservices ocr-text-reader open-source pdf-extractor-llm

📄 LLM-PDF-Parser

LLM-PDF-Parser is a FastAPI-based application that extracts text from PDFs and images and uses Ollama's LLM to extract specific fields based on a given JSON template. 🚀

✨ Features

📝 Extract text from PDFs and images (JPG, PNG, JPEG) using PyMuPDF and EasyOCR.
🤖 Leverage AI to extract structured data based on a provided JSON template.
⚡ FastAPI backend for quick and easy integration.
🔥 Supports OCR when text extraction from PDFs is insufficient.
🔄 Cross-Origin Resource Sharing (CORS) enabled for flexible frontend integration.

🚀 Installation & Setup

1️⃣ Clone the Repository

git clone https://github.com/RiccardoTOTI/LLM-PDF-Extractor.git
cd LLM-PDF-Parser

2️⃣ Install Dependencies

pip install -r requirements.txt

3️⃣ Set Up Environment Variables

Create a .env file and configure the following variables (or set them in your environment):

OLLAMA_URL=http://localhost:11434
OLLAMA_MODEL=iodose/nuextract-v1.5

4️⃣ Run the Application

uvicorn main:app --host 0.0.0.0 --port 8000 --reload

5️⃣ Spin Ollama server

ollama run iodose/nuextract-v1.5

🐳 Docker Support

You can also run the application using Docker.

🏗 Build and Run with Docker

Build the Docker image:
```
docker build -t llm-pdf-parser .
```
Run the container:
```
docker run -p 8000:8000 llm-pdf-parser
```

🔄 Using Docker Compose

You can use Docker Compose to spin up both the application and Ollama:

Run the services:
```
docker-compose up -d
```
Download the model using the script inside tools directory in the Ollama Container:
```
docker exec -it ollama .tools/download_model.sh iodose/nuextract-v1.5
```

🔥 API Usage

📥 Upload a File & Extract Data

Endpoint: POST /extract

Request:

curl -X 'POST' \
  'http://localhost:8000/extract' \
  -H 'accept: application/json' \
  -H 'Content-Type: multipart/form-data' \
  -F 'file=@yourfile.pdf' \
  -F 'fields={"Patient":{"First Name":"","Last Name":"","Tax Code":"","Doctor":[]}}
'

Response Example:

{
  "extracted_data": {
  "Patient": {
    "First Name": "John",
    "Last Name": "Doe",
    "Tax Code": "ABC123XYZ",
    "Doctor": [ #support list of elements
        "Dr. Smith",
        "Dr. Bean"
    ]
  }
}
}

🏗 How It Works

Uploads a PDF or Image via FastAPI.
Extracts text using PyMuPDF or EasyOCR (for scanned documents/images).
Sends extracted text to Ollama's LLM, which structures it based on the provided JSON template.
Returns extracted structured data as a JSON response. ✅

🛠 Technologies Used

Python 🐍
FastAPI ⚡
PyMuPDF 📄
EasyOCR 🔍
Ollama LLM 🤖
Uvicorn 🚀

🏆 Contributing

Contributions are welcome! Feel free to submit issues or open a pull request. 😊

📜 License

This project is licensed under the Apache 2.0 License.

💡 Have suggestions or need help? Open an issue or reach out! 🚀

About

LLM-PDF-Parser is a FastAPI-based application that extracts text from PDFs and images and uses NuExtract LLM to extract specific fields based on a given JSON template.

docker docker-compose image-recognition kubernetes llm microservices ocr-text-reader open-source pdf-extractor-llm

Apache License 2.0

Languages

Language:Python 92.7%Language:Dockerfile 7.3%