DigitalPebble Ltd (DigitalPebble)

DigitalPebble Ltd

DigitalPebble

Geek Repo

Location:Bristol, UK

Home Page:http://www.digitalpebble.com

Github PK Tool:Github PK Tool

DigitalPebble Ltd's repositories

behemoth

Behemoth is an open source platform for large scale document analysis based on Apache Hadoop.

Language:JavaLicense:NOASSERTIONStargazers:282Issues:44Issues:42

TextClassification

A Text Classification API in Java originally developed by DigitalPebble Ltd. The API is independent from the ML implementations used and can be used as a front end to various ML algorithms. libSVM and liblinear are currently embedded.

Language:JavaLicense:Apache-2.0Stargazers:48Issues:15Issues:1

textclassification-examples

Use cases for DigitalPebble's TextClassification API

Language:JavaLicense:Apache-2.0Stargazers:10Issues:2Issues:0

stormcrawlerfight

Crawl configurations for benchmarking / testing StormCrawler

Language:ShellLicense:Apache-2.0Stargazers:9Issues:3Issues:0

ansible-storm

Ansible playbook for deploying a Storm cluster

stormcrawler-docker

Resources for running StormCrawler with Docker services

Language:DockerfileLicense:Apache-2.0Stargazers:6Issues:4Issues:1

TextClassificationPlugin

GATE Processing Resource wrapping DigitalPebble's TextClassification API

behemoth-commoncrawl

Support for old (pre 2013) CommonCrawl dataset in Behemoth

Language:JavaStargazers:4Issues:6Issues:0

crawler-commons

A set of reusable Java components that implement functionality common to any web crawler

Language:JavaLicense:Apache-2.0Stargazers:4Issues:6Issues:0

ngrams-api

Java API for querying a N-Grams corpus. Uses Lucene for searching and indexing from the Google Web-1T format

Language:JavaLicense:NOASSERTIONStargazers:4Issues:2Issues:0

NutchFight

Resources for comparison between 1.8 and 2.x of Apache Nutch

Language:JavaLicense:Apache-2.0Stargazers:4Issues:3Issues:0

tescobank

Setup for crawling tescobank with SC

Language:JavaLicense:Apache-2.0Stargazers:4Issues:4Issues:0

sc-warc

WARC resources for StormCrawler

behemoth-elasticsearch

ElasticSearch module for Behemoth

Language:JavaStargazers:1Issues:5Issues:0

behemoth-textclassification

Module for classifying Behemoth documents with a model from our Text Classification API

Language:JavaStargazers:1Issues:2Issues:0

crawlurlfrontier

Crawl config used to test URL Frontier on a large scale and produce WARCs for CommonCrawl.

Language:FLUXStargazers:1Issues:3Issues:0

nutch

Apache Nutch is an extensible and scalable web crawler

Language:JavaLicense:Apache-2.0Stargazers:1Issues:1Issues:0

urlfrontier-client

URLFrontier client written in Rust (mostly as a way of learning Rust)

Language:RustLicense:Apache-2.0Stargazers:1Issues:0Issues:0

benchmark

StormCrawler topology to evaluate the performance of different backends and configurations

Language:ShellStargazers:0Issues:0Issues:0
Language:JavaStargazers:0Issues:2Issues:0

digitalpebble.github.io

Resources for the DigitalPebble website

Language:SCSSStargazers:0Issues:2Issues:0

docs

Documentation for Docker Official Images in docker-library

Language:ShellLicense:MITStargazers:0Issues:1Issues:0

storm

Mirror of Apache Storm

Language:JavaLicense:Apache-2.0Stargazers:0Issues:3Issues:0

tika

The Apache Tika toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF).

Language:JavaLicense:Apache-2.0Stargazers:0Issues:0Issues:0

tika-cc

resources for generating a corpus of docs from CC for Tika

Language:ShellStargazers:0Issues:3Issues:0

tika-detector-stormcrawler

Wraps the charset detection logic from StormCrawler as a Tika module

Language:JavaLicense:Apache-2.0Stargazers:0Issues:0Issues:0