PACER Docket Parser - Parsing PACER Docket PDF files with ease

This project aims to extract structured information from PACER docket PDF files and store in JSON formats specified here.

Installation

Install the docketparser library via:

git clone https://github.com/allenai/pacer-docket-parser.git
cd ./pacer-docket-parser 
pip install -e .
pip install 'git+https://github.com/facebookresearch/detectron2.git#egg=detectron2'

You might find more instructions on Detectron2 installation here.

We use Poppler to render PDF documents as images - the installation methods are different based on your platform:
1. Mac: brew install poppler
2. Ubuntu: sudo apt-get install -y poppler-utils
3. Windows: See this post

Usage

Docket Table detection and extraction for PDF

Use the following command to extract docket tables from a pdf file:

docketparser parse-all [PDF_FILES] [SAVE_PATH]

It will save the extracted table and metadata json for each PDF_FILE as filename.csv and filename.json. Please check the exemplar outputs here. For each PDF, we generate a single csv file, which merges the tables from each page according.

jakobjanot / pacer-docket-parser

PACER Docket Parser - Parsing PACER Docket PDF files with ease

Installation

Usage

Docket Table detection and extraction for PDF

About

Languages