Marco Didonna's starred repositories
wikihadoop
Stream-based InputFormat for processing the compressed XML dumps of Wikipedia with Hadoop
HadoopPerceptron
http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//pubs/archive/36266.pdf
twitter_nlp
Twitter NLP Tools
ark-tweet-nlp
CMU ARK Twitter Part-of-Speech Tagger
MongoReduce
Hadoop Input and Ouput formats for MongoDB
cascading.solr
Cascading scheme for Solr
python-snappy
Python bindings for the snappy google library
Pig-scripting-examples
Examples of use of pig scripting languages capabilities
grouperfish
Text clustering service for the web
elephantdb
Distributed database specialized in exporting key/value data from Hadoop
elephant-bird
Twitter's collection of LZO and Protocol Buffer-related Hadoop, Pig, Hive, and HBase code.
FileSetInputFormat
A Hadoop input format for sending lists of files as keys to a mapper. Set the list of files, and an input split will be created per file. Each map task gets only one input key: the filename for its split.