Scale Unlimited

Cascading is a feature rich API for defining and executing complex and fault tolerant data processing workflows on various cluster computing platforms. Please see https://github.com/cwensel/cascading for access to all WIP branches.

NOASSERTION000

cascading.classify

Linear SVM for Cascading-based workflows

Language:JavaApache-2.0000

cascading.cuke

Integration of Cucumber with Cascading

Language:JavaApache-2.0000

cascading.lucene

Cascading 2.0 Scheme for writing out Lucene indexes using Tuple field values.

Language:Java000

cucumber-jvm

Cucumber for the JVM

MIT000

fastText

Library for fast text representation and classification.

Language:HTMLNOASSERTION000

flink-crawler-ccdemo

Demo of using flink-crawler to extract pages from Common Crawl for a target language

Language:JavaApache-2.0000

flink-multisource

Classes that wrap multiple source functions in useful ways

Apache-2.0000

flink-utils

Utilities for use with Flink

Language:JavaApache-2.0000

fse4j

Java port of FiniteStateEntropy project in GitHub (https://github.com/Cyan4973/FiniteStateEntropy)

Apache-2.0000

http-fetcher

Wrapper code for Apache HttpClient that provides common page fetching functionality

Language:JavaApache-2.0000

JFastText

Java interface for fastText

Language:JavaNOASSERTION000

lucene-solr

Mirror of Apache Lucene + Solr

Language:Java000

pinot

Apache Pinot (Incubating) - A realtime distributed OLAP datastore

Language:JavaApache-2.0000

tenaya

Tenaya is code that processes FASTQ files from the Sequence Read Archive (SRA), and identifies reads with bad metadata (e.g. wrong species) and/or bad read data.

Language:JavaApache-2.0000

wikiwords

Code to create mapping from words to Wikipedia article titles (topics) and categories

Language:JavaApache-2.0000

yahoo-streaming-benchmark

An extension of Yahoo's Benchmarks

Language:JavaApache-2.0000