DigitalPebble Ltd's repositories
TextClassification
A Text Classification API in Java originally developed by DigitalPebble Ltd. The API is independent from the ML implementations used and can be used as a front end to various ML algorithms. libSVM and liblinear are currently embedded.
textclassification-examples
Use cases for DigitalPebble's TextClassification API
stormcrawlerfight
Crawl configurations for benchmarking / testing StormCrawler
ansible-storm
Ansible playbook for deploying a Storm cluster
stormcrawler-docker
Resources for running StormCrawler with Docker services
TextClassificationPlugin
GATE Processing Resource wrapping DigitalPebble's TextClassification API
behemoth-commoncrawl
Support for old (pre 2013) CommonCrawl dataset in Behemoth
crawler-commons
A set of reusable Java components that implement functionality common to any web crawler
ngrams-api
Java API for querying a N-Grams corpus. Uses Lucene for searching and indexing from the Google Web-1T format
NutchFight
Resources for comparison between 1.8 and 2.x of Apache Nutch
behemoth-elasticsearch
ElasticSearch module for Behemoth
behemoth-textclassification
Module for classifying Behemoth documents with a model from our Text Classification API
crawlurlfrontier
Crawl config used to test URL Frontier on a large scale and produce WARCs for CommonCrawl.
urlfrontier-client
URLFrontier client written in Rust (mostly as a way of learning Rust)
benchmark
StormCrawler topology to evaluate the performance of different backends and configurations
digitalpebble.github.io
Resources for the DigitalPebble website
tika
The Apache Tika toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF).
tika-detector-stormcrawler
Wraps the charset detection logic from StormCrawler as a Tika module