huridocs/pdf-table-of-contents-extractor

PDF Table of Contents Extraction

A Docker-powered service for extracting Table of Contents information from PDF documents

This project aims to extract Table of Contents (TOC) information from PDF files using the outputs generated by the pdf-document-layout-analysis service. By leveraging the segmentation and classification capabilities of the underlying analysis tool, this project automates the process of identifying and structuring the document's TOC.

You can check the pdf-document-layout-analysis service from here:

https://github.com/huridocs/pdf-document-layout-analysis

Quick Start

Start the service:

# With GPU support
make start

# Without GPU support [if you do not have a GPU on your system]
make start_no_gpu

Get the segments from a PDF:

curl -X POST -F 'file=@/PATH/TO/PDF/pdf_name.pdf' localhost:5070

To stop the server:

make stop

Dependencies

Docker Desktop 4.25.0 install link

Requirements

4 GB RAM memory
6 GB GPU memory (if not, it will run with CPU)

Usage

As we mentioned at the Quick Start, you can use the service simply like this:

curl -X POST -F 'file=@/PATH/TO/PDF/pdf_name.pdf' localhost:5070

Also, if you want to get the results faster (but with slightly worse results) you can run this command:

curl -X POST -F 'file=@/PATH/TO/PDF/pdf_name.pdf' localhost:5070/fast

For more information about models, check this link.

When the process is done, the output will include a list of TOCItem elements and, every TOCItem element will has this information:

    {
        "indentation": Level of indentation
        "label": Content of the respective item
        "selectionRectangles": List of rectangles for the respective item
    }

And every selectionRectangle item will include this information:

    {
        "left": Left position of the rectangle
        "top": Top position of the rectangle
        "width": Width of the rectangle
        "height": Height of the rectangle
        "page": Page number which the rectangle belongs
    }

And to stop the server, you can simply use this:

make stop

About

This project aims to extract Table of Contents (TOC) information from PDF files using the outputs generated by the pdf-document-layout-analysis service. By leveraging the segmentation and classification capabilities of the underlying analysis tool, this project automates the process of identifying and structuring the document's TOC.

Apache License 2.0

Languages

Language:Python 95.2%Language:Makefile 2.6%Language:Dockerfile 2.2%