EleutherAI / the-pile

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Legal Contracts

hendrycks opened this issue · comments

Here are legal contracts collected from the Securities and Exchange Commission.
https://drive.google.com/file/d/1of37X0hAhECQ3BN_004D8gm6V88tgZaB/view?usp=sharing
It's about ~38 GB raw and full of txt files separated by year. This should hopefully be easy to add to The Pile since we already scraped and processed the data.
Contract review is large chunk of the legal NLP problem, and it will complement The Pile's Free Law data well.

@hendrycks This is awesome! Thank you so much.

You are welcome to open a PR with code that integrates the downloading and processing of these files. This involves adding a new dataset class to the_pile/datasets.py. If you do this, please push to the version2 branch.

Otherwise we will get around to this soon. Right now we are busy finishing up a couple papers but we'll get to it.