dvryaboy's repositories
idl_storage_guidelines
This document attempts to capture useful patterns and warn about subtle gotchas when it comes to designing and evolving schemas for long-term serialized data. It is not intended as a guide for how to best represent a particular dataset or process.
elephant-bird
Twitter's collection of LZO and Protocol Buffer-related Hadoop, Pig, and HBase code.
piglatin-mode
PigLatin mode for Emacs.
elephant-twin
Elephant Twin is a framework for creating indexes in Hadoop
elephant-twin-lzo
Elephant Twin LZO uses Elephant Twin to create LZO block indexes
Vertica-Hadoop-Connector
Vertica Hadoop Connector
awesome-bigdata
A curated list of awesome big data frameworks, ressources and other awesomeness.
flume
Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. The system is centrally managed and allows for intelligent dynamic management. It uses a simple extensible data model that allows for online analytic applications.
hadoop-lzo
Patched, refactored version of code.google.com/hadoop-gpl-compression for hadoop 0.20
apache-proposal
Apache Incubator Proposal for Parquet Format
cascading
Cascading is a feature rich API for defining and executing complex and fault tolerant data processing workflows on a Hadoop cluster.
gitbook
The GitBook documentation for Aqueduct
Impatient
source examples to support the "Cascading for the Impatient" blog post series
incubator-parquet-format
Mirror of Apache Parquet
incubator-parquet-mr
Mirror of Apache Parquet
lakeFS
lakeFS - Data version control for your data lake | Git for data
MassQueryLanguage
The Mass Spec Query Language (MassQL) is a domain specific language meant to be a succinct way to express a query in a mass spectrometry centric fashion.
parquet-format-1
As we are moving to Apache, please open your pull requests on: https://github.com/apache/incubator-parquet-format
pdi-google-spreadsheet-plugin
Plugin for Pentaho Data Integration allowing reading and writing of Google Spreadsheets
redelm
an anagram
scalding
A Scala API for Cascading
semantic-versioning
Java library relying on semver.org principles to check binary code compatibility