This fetches full text from Library of Congress OCR files for LOC items. It
returns the text, when found, and None
otherwise.
It can take as input either a result item from a JSON API response or the URL of an item:
from locr import Fetcher
# From item or resource URL
Fetcher.full_text_from_url('https://www.loc.gov/resource/mss85943.001811/')
# From search result
# See https://libraryofcongress.github.io/data-exploration/requests.html
url = 'https://www.loc.gov/search/?fo=json&fa=subject:cats'
response = requests.get(url)
Fetcher(response['results'][0]).full_text()
Note that the above example is not guaranteed to work. In particular, not all objects have online text available.
Fetcher may raise the following exceptions:
ObjectNotOnline
: when the object does not have any online formats.AmbiguousText
: when multiple fulltext options are found.UnknownFormat
: when locr is not sure how to handle the fulltext link's filetype.
If you encounter these exceptions, kindly file an issue or open a PR about the newly discovered edge case. Thanks.
The Library of Congress has put OCRed full text online for many of its items. However:
- the API does not in general return the URLs to these items
- OCRed text exists on different servers, with different URL formats; there is not one single way to construct the relevant URL for an item
While full text is easy to retrieve via the web site for a single item, perhaps you, like me, would like to fetch it programmatically.
This package has a humiliating lack of tests, and I have done nothing to verify appropriate versions for dependencies. It really can use your help. PRs welcome.