Tim Allison's repositories
commoncrawl-fetcher-lite
Simplified version of a common crawl fetcher
file-observatory
Single server/laptop grade file-observatory
tika-gui-v2
Unofficial user interface for Apache Tika
SimpleCommonCrawlExtractor
Simple wrapper around IIPC Web Commons to take a literal warc.gz and extract standalone binaries
awesome-digital-preservation
Carefully curated list of awesome digital preservation resources.
hodgepodge
one off dev repo, very experimental
language-detector
Language Detection Library for Java
tika-addons
Addons not part of the official Tika release
any23
Apache Anything To Triples (Any23) is a library, a web service and a command line tool that extracts structured data in RDF format from a variety of Web documents.
commons-compress
Mirror of Apache Commons Compress
commons-io
Apache Commons IO
incubator-stormcrawler
A scalable, mature and versatile web crawler based on Apache Storm
metadata-extractor
Extracts Exif, IPTC, XMP, ICC and other metadata from image files
opensearch-java
Java Client for OpenSearch
poi
Mirror of Apache POI
tika-arlington-pdf-model
Simple wrapper around the Arlington PDF model's TestGrammar
tika-detector-stormcrawler
Wraps the charset detection logic from StormCrawler as a Tika module
tika-docker
Convenience Docker images for Apache Tika Server
tika-eval-multi-comparer
Demo tika-eval-multi-comparer