collawolley/LegalCrawler

Legal Crawler 🐙

A collection of scripts to crawl English legal corpora 📕 from open public domains.

The current version supports the following domains:

Corpus	Domain	Corpus alias
🇪🇺 EU legislation	https://eur-lex.europa.eu/	`eu`
🇬🇧 UK legislation	https://legislation.gov.uk/	`uk`
🇨🇦 Canadian legislation	http://laws.justice.gc.ca/eng/	`ca`
🇯🇵 Japanese legislation	http://www.japaneselawtranslation.go.jp/law/	`jp`
🇫🇮 Finish legislation	https://www.finlex.fi/en	`fi`
🇺🇸 US case law*	https://case.law/bulk/download/	`us`

* In order to use the script for US case law, you need to first apply for a researcher account at https://case.law.

For US public filings, e.g., contracts, please use the library OpenEDGAR (https://github.com/LexPredict/openedgar) by LexPredict.
Documents are saved in raw text format, amend the code if you wish to better handle metadata, document structure, etc.

‼️ Disclaimer ‼️

If you aim to use the code, please carefully read the individual license agreements with respect to re-use, re-publication, terms of use, etc. 📝
The text cleansing from the original PDF/HTML files is minimal. Consider amending the scripts and/or writing your own post-processing data cleansing process that better fit for each corpus. 🚧
These scripts aim to give researchers a kick start for scraping legal corpora from public domains. They should not considered a stand-alone qualified solution. 🚧

Project Requirements:

Python packages

json-lines
tqdm
beautifulsoup4

Linux packages (command line tools)

The following linux packages are used to process PDF documents:

pdftocairo
pdftotext
mutool
gs

Quick start:

Install python requirements:

pip install -r requirements.txt

sudo apt-get install libcairo2-dev
sudo apt-get install libpango1.0-dev
sudo apt-get install -y xpdf
sudo apt-get install mupdf mupdf-tools

Download Canadian legislation

python download_legal_corpora.py --corpus ca

Download EU legislation

python download_legal_corpora.py --corpus eu

Download all (EU, UK, CA, FI, JP, US)

python download_legal_corpora.py --corpus all

collawolley / LegalCrawler