Biodiversity Heritage Library
cfoster0 opened this issue · comments
Language: primarily English, with a few thousand works total in German, French, Spanish, Dutch, Portuguese, and Latin
Date ranges: Primarily pre-1923
Size: Unclear. A large number of full length books, so likely > 1GB.
The Biodiversity Heritage Library has a very large collection (~250,000) of pre-OCR'd historical books and documents on natural history topics. https://about.biodiversitylibrary.org/tools-and-services/developer-and-data-tools/
The individual .txt file links are listed in the ItemTextURL column of this TSV (warning: this link leads to a 40+MB file) https://www.biodiversitylibrary.org/data/hosted/item.txt
My primary concern is with the quality of the OCR.
I think that this would be a phenomenal way to augment our knowledge set. A key question is just how low-quality the OCR is though, and how much work we would expect processing it to take.