Scale Unlimited's repositories
flink-crawler
Continuous scalable web crawler built on top of Flink and crawler-commons
cascading.solr
Cascading scheme for Solr
cascading.utils
Utilities for Cascading
cascading.avro
Cascading Scheme for the Apache Avro data serialization format
cascading.simpledb
Cascading Tap & Scheme for Amazon's SimpleDB
wikipedia-ngrams
Code to split/parse Wikipedia XML dump
text-similarity
Source code for blog post series on text features for similarity calculation
flink-streaming-kmeans
Simple implementation of KMeans clustering on Flink, using iterations
liblinear-java
Java version of LIBLINEAR
cascading.snippets
Snippets of useful Cascading code.
ec2instances.info
Amazon EC2 instance comparison site
scaleunlimited.github.com
Maven repo for Java components that aren't in a public Maven repo.
atomizer
Cascading-based workflow to process noisy record-based data
cascading
Cascading is a feature rich API for defining and executing complex and fault tolerant data processing workflows on various cluster computing platforms. Please see https://github.com/cwensel/cascading for access to all WIP branches.
cascading.classify
Linear SVM for Cascading-based workflows
cascading.cuke
Integration of Cucumber with Cascading
cascading.lucene
Cascading 2.0 Scheme for writing out Lucene indexes using Tuple field values.
cucumber-jvm
Cucumber for the JVM
fastText
Library for fast text representation and classification.
flink-crawler-ccdemo
Demo of using flink-crawler to extract pages from Common Crawl for a target language
flink-multisource
Classes that wrap multiple source functions in useful ways
flink-utils
Utilities for use with Flink
fse4j
Java port of FiniteStateEntropy project in GitHub (https://github.com/Cyan4973/FiniteStateEntropy)
http-fetcher
Wrapper code for Apache HttpClient that provides common page fetching functionality
JFastText
Java interface for fastText
lucene-solr
Mirror of Apache Lucene + Solr
pinot
Apache Pinot (Incubating) - A realtime distributed OLAP datastore
tenaya
Tenaya is code that processes FASTQ files from the Sequence Read Archive (SRA), and identifies reads with bad metadata (e.g. wrong species) and/or bad read data.
wikiwords
Code to create mapping from words to Wikipedia article titles (topics) and categories
yahoo-streaming-benchmark
An extension of Yahoo's Benchmarks