EleutherAI / the-pile

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Biodiversity Heritage Library

cfoster0 opened this issue · comments

Language: primarily English, with a few thousand works total in German, French, Spanish, Dutch, Portuguese, and Latin
Date ranges: Primarily pre-1923
Size: Unclear. A large number of full length books, so likely > 1GB.

The Biodiversity Heritage Library has a very large collection (~250,000) of pre-OCR'd historical books and documents on natural history topics. https://about.biodiversitylibrary.org/tools-and-services/developer-and-data-tools/

The individual .txt file links are listed in the ItemTextURL column of this TSV (warning: this link leads to a 40+MB file) https://www.biodiversitylibrary.org/data/hosted/item.txt

My primary concern is with the quality of the OCR.

I think that this would be a phenomenal way to augment our knowledge set. A key question is just how low-quality the OCR is though, and how much work we would expect processing it to take.