Awesome Data Engineering

A curated list of data engineering tools for software developers

List of content

Databases

Apache Avro Apache Avro™ is a data serialization system
Apache Parquet Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.
Apache Thrift The Apache Thrift software framework, for scalable cross-language services development
ProtoBuf Protocol Buffers - Google's data interchange format
SequenceFile SequenceFile is a flat file consisting of binary key/value pairs. It is extensively used in MapReduce as input/output formats

Spark Streaming Spark Streaming makes it easy to build scalable fault-tolerant streaming applications.
Apache Flink Apache Flink is a streaming dataflow engine that provides data distribution, communication, and fault tolerance for distributed computations over data streams.
Apache Storm Apache Storm is a free and open source distributed realtime computation system
Apache Samza Apache Samza is a distributed stream processing framework
Apache NiFi is an easy to use, powerful, and reliable system to process and distribute data

[Hadoop MapReduce] (http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html) Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner
[Spark] (https://spark.apache.org/)
- Spark Packages A community index of packages for Apache Spark
- Deep Spark Connecting Apache Spark with different data stores
[AWS EMR] (http://aws.amazon.com/elasticmapreduce/)
Flink
[Tez] (https://tez.apache.org/)

[Flask] (http://flask.pocoo.org/)
[D3] (http://d3js.org/)
- [D3Plus] (http://d3plus.org) D3's simplier, easier to use cousin. Mostly predefined templates that you can just plug data in.
[AngularJS] (https://angularjs.org/)
[Django] (https://www.djangoproject.com/)
[Highcharts] (http://www.highcharts.com/)
C3.js D3-based reusable chart library

[GitHub Archive] (https://www.githubarchive.org/) GitHub's public timeline since 2011, updated every hour
[Common Crawl] (https://commoncrawl.org/) Open source repository of web crawl data

Inspired by the awesome list. Created by Insight Data Engineering fellows.