DigitalPebble Ltd

DigitalPebble Ltd's repositories

behemoth

Behemoth is an open source platform for large scale document analysis based on Apache Hadoop.

Language:JavaNOASSERTION282 44 42

A Text Classification API in Java originally developed by DigitalPebble Ltd. The API is independent from the ML implementations used and can be used as a front end to various ML algorithms. libSVM and liblinear are currently embedded.

Language:JavaApache-2.048 15 1

textclassification-examples

Use cases for DigitalPebble's TextClassification API

Language:JavaApache-2.010 20

stormcrawlerfight

Crawl configurations for benchmarking / testing StormCrawler

Language:ShellApache-2.09 30

ansible-storm

Ansible playbook for deploying a Storm cluster

7 5 1

stormcrawler-docker

Resources for running StormCrawler with Docker services

Language:DockerfileApache-2.06 4 1

TextClassificationPlugin

GATE Processing Resource wrapping DigitalPebble's TextClassification API

Language:Java5 9 1

behemoth-commoncrawl

Support for old (pre 2013) CommonCrawl dataset in Behemoth

Language:Java4 60

crawler-commons

A set of reusable Java components that implement functionality common to any web crawler

Language:JavaApache-2.04 60

ngrams-api

Java API for querying a N-Grams corpus. Uses Lucene for searching and indexing from the Google Web-1T format

Language:JavaNOASSERTION4 20

NutchFight

Resources for comparison between 1.8 and 2.x of Apache Nutch

Language:JavaApache-2.04 30

tescobank

Setup for crawling tescobank with SC

Language:JavaApache-2.04 40

sc-warc

WARC resources for StormCrawler

2 3 11

behemoth-elasticsearch

ElasticSearch module for Behemoth

Language:Java1 50

behemoth-textclassification

Module for classifying Behemoth documents with a model from our Text Classification API

Language:Java1 20

crawlurlfrontier

Crawl config used to test URL Frontier on a large scale and produce WARCs for CommonCrawl.

Language:FLUX1 30

nutch

Apache Nutch is an extensible and scalable web crawler

Language:JavaApache-2.01 10

urlfrontier-client

URLFrontier client written in Rust (mostly as a way of learning Rust)

Language:RustApache-2.0100

benchmark

StormCrawler topology to evaluate the performance of different backends and configurations

Language:Shell000

crawler4j-frontier-battle

Language:Java020

digitalpebble.github.io

Resources for the DigitalPebble website

Language:SCSS020

docs

Documentation for Docker Official Images in docker-library

Language:ShellMIT010

storm

Mirror of Apache Storm

Language:JavaApache-2.0030

tika

The Apache Tika toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF).

Language:JavaApache-2.0000

tika-cc

resources for generating a corpus of docs from CC for Tika

Language:Shell030

tika-detector-stormcrawler

Wraps the charset detection logic from StormCrawler as a Tika module

Language:JavaApache-2.0000