Yves Maurer's repositories
cdx-summarize
Summarize CDX(J) files for MIME analysis per 2nd-level domain
cdx-summarize-warc-indexer
Summarize Web Archive holdings using an existing SOLR index
eluxemburgensia-opendata-ark
Get the Archival resource keys from eluxemburgensia.lu public opendata set (the text analysis pack)
fasttiffcrop
crop multiple jpegs from a single source tiff in a fast and memory-efficient way
cdx-index-client
A command-line tool for using CommonCrawl Index API at http://index.commoncrawl.org/
common-crawl-dl
Download common crawl data for some top level domains
fixit_tiff
fixes some issues in (potentially) baseline tiffs
langchain
⚡ Building applications with LLMs through composability ⚡
mets-export-illustrations
Export illustrations from METS files alongside metadata
speller-ocr-eval
Evaluate OCR correctness by identifying the language and then running a spell checker
warcnet-cdx-summarize-analysis
Generate reports from cdx-summarize files