jakobjanot / pacer-docket-parser

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

PACER Docket Parser - Parsing PACER Docket PDF files with ease

This project aims to extract structured information from PACER docket PDF files and store in JSON formats specified here.

Installation

  1. Install the docketparser library via:

    git clone https://github.com/allenai/pacer-docket-parser.git
    cd ./pacer-docket-parser 
    pip install -e .
    pip install 'git+https://github.com/facebookresearch/detectron2.git#egg=detectron2' 

    You might find more instructions on Detectron2 installation here.

  2. We use Poppler to render PDF documents as images - the installation methods are different based on your platform:

    1. Mac: brew install poppler
    2. Ubuntu: sudo apt-get install -y poppler-utils
    3. Windows: See this post

Usage

Docket Table detection and extraction for PDF

Use the following command to extract docket tables from a pdf file:

docketparser parse-all [PDF_FILES] [SAVE_PATH]

It will save the extracted table and metadata json for each PDF_FILE as filename.csv and filename.json. Please check the exemplar outputs here. For each PDF, we generate a single csv file, which merges the tables from each page according.

About

License:Apache License 2.0


Languages

Language:Python 100.0%