This project aims to extract structured information from PACER docket PDF files and store in JSON formats specified here.
-
Install the
docketparser
library via:git clone https://github.com/allenai/pacer-docket-parser.git cd ./pacer-docket-parser pip install -e . pip install 'git+https://github.com/facebookresearch/detectron2.git#egg=detectron2'
You might find more instructions on Detectron2 installation here.
-
We use Poppler to render PDF documents as images - the installation methods are different based on your platform:
- Mac:
brew install poppler
- Ubuntu:
sudo apt-get install -y poppler-utils
- Windows: See this post
- Mac:
Use the following command to extract docket tables from a pdf file:
docketparser parse-all [PDF_FILES] [SAVE_PATH]
It will save the extracted table and metadata json for each PDF_FILE as filename.csv
and filename.json
. Please check the exemplar outputs here. For each PDF, we generate a single csv file, which merges the tables from each page according.