badele / Data-Infra-Projects

List of some interesting projects

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Data-Infra-Projects

This is an attempt to list out all the interesting projects.

It is intended for anyone designing modern large scale architectures and need to choose tools/technoglogies/frameworks. The purpose is to help in making that choices with resources like comparisons/use-cases/features/maturity or really anything that helps in making an informed decision.

TODO: Add links and licenses.

##Abstractions

##Distributed Coordination

This are implementations/libraries to help write distributed applications which require some form of coordination.

##Infrastructure Management

#####comparisons

##File Systems

##Distribtued Databases

##Infrastrcuture Logging/Monitoring

##Infrastructure Helpers

MultiCloud/CrossCloud utilities

##Virtualization

##Virtualization++

##Generalized Data Processing

#####comparisons

  • Tez vs Dryad
  • Hadoop vs Spark - Too many differences, no good link.

##Largescale Distributed ML

##pub-sub / messaging

##Data Ingest

##Graph Storing and/or Processing

##SQL Engines

##Stream Processing

##Security

##Performance Analysis

##Workflow engines/DAG-executors/Pipelines

#####Comparisons

##Configuration Management

##Service Discovery

#####Comparison

##Testing

##Visualization

##Libraries

  • Zoie
  • Norbert - cluster manager and networking layer built on top of Zookeeper.
  • Okapi - Large-scale ML & graph analytics on Giraph
  • Scalding - A Scala API for Cascading
  • SummingBird - Streaming MapReduce with Scalding and Storm
  • Curator - set of Java libraries that make using Apache ZooKeeper much easier
  • Turbine - Low latency high throughput aggregator for real time streams
  • DataFu - Collection of MapReduce lib
  • Twill (Previsously known as Weave) - YARN application writing lib

##Search

others

  • Nutch - web crawler
  • Ambari - Hadoop Deployment + Management
  • Bigtop - Hadoop Packaging
  • Skuld
  • Camus - LinkedIn's Kafka to HDFS pipeline.
  • Kiji - collect, analyze and serve data in real time on Apache Hadoop and HBase

About

List of some interesting projects