A curated list of data engineering tools for software developers
List of content
- [Databases] (#databases)
- Ingestion
- [File System] (#file-system)
- File Format
- Stream Processing
- [Batch Processing] (#batch-processing)
- [Front End] (#front-end)
- Relational
- [MySQL] (http://www.mysql.com/)
- [PostgreSQL] (http://www.postgresql.org/)
- [Amazon RDS] (http://aws.amazon.com/rds/)
- Key-Value
- [Redis] (http://redis.io/)
- [Riak] (https://docs.basho.com/riak/latest/)
- [AWS DynamoDB] (http://aws.amazon.com/dynamodb/)
- Column
- [Cassandra] (http://cassandra.apache.org/)
- [HBase] (http://hbase.apache.org/)
- [AWS Redshift] (http://aws.amazon.com/redshift/)
- Document
- [MongoDB] (https://www.mongodb.org/)
- [Elasticsearch] (https://www.elastic.co/)
- [Couchbase] (http://www.couchbase.com/)
- Graph
- [Neo4j] (http://neo4j.com/)
- [OrientDB] (http://orientdb.com/orientdb/)
- [ArangoDB] (https://www.arangodb.com/)
- [Titan] (http://thinkaurelius.github.io/titan/)
- [Kafka] (http://kafka.apache.org/)
- Camus LinkedIn's Kafka to HDFS pipeline.
- BottledWater Change data capture from PostgreSQL into Kafka
- kafkat Simplified command-line administration for Kafka brokers
- kafkacat Generic command line non-JVM Apache Kafka producer and consumer
- pg-kafka A PostgreSQL extension to produce messages to Apache Kafka
- librdkafka The Apache Kafka C/C++ library
- kafka-docker Kafka in Docker
- kafka-manager A tool for managing Apache Kafka
- kafka-node Node.js client for Apache Kafka 0.8
- [Secor] (https://github.com/pinterest/secor) Pinterest's Kafka to S3 distributed consumer
- [AWS Kinesis] (http://aws.amazon.com/kinesis/)
- RabbitMQ
- FluentD
- Apache Scoop
- [HDFS] (https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html)
- [AWS S3] (http://aws.amazon.com/s3/)
- [Tachyon] (http://tachyon-project.org/)
- Apache Avro Apache Avro™ is a data serialization system
- Apache Parquet Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.
- Apache Thrift The Apache Thrift software framework, for scalable cross-language services development
- ProtoBuf Protocol Buffers - Google's data interchange format
- SequenceFile SequenceFile is a flat file consisting of binary key/value pairs. It is extensively used in MapReduce as input/output formats
- Spark Streaming Spark Streaming makes it easy to build scalable fault-tolerant streaming applications.
- Apache Flink Apache Flink is a streaming dataflow engine that provides data distribution, communication, and fault tolerance for distributed computations over data streams.
- Apache Storm Apache Storm is a free and open source distributed realtime computation system
- Apache Samza Apache Samza is a distributed stream processing framework
- Apache NiFi is an easy to use, powerful, and reliable system to process and distribute data
- [Hadoop MapReduce] (http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html) Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner
- [Spark] (https://spark.apache.org/)
- Spark Packages A community index of packages for Apache Spark
- Deep Spark Connecting Apache Spark with different data stores
- [AWS EMR] (http://aws.amazon.com/elasticmapreduce/)
- Flink
- [Tez] (https://tez.apache.org/)
- Batch ML
- [H2O] (http://h2o.ai/)
- [Mahout] (http://mahout.apache.org/)
- [Spark MLlib] (https://spark.apache.org/docs/1.2.1/mllib-guide.html)
- Batch Graph
- [GraphLab] (https://dato.com/products/create/)
- [Giraph] (http://giraph.apache.org/)
- [Spark GraphX] (https://spark.apache.org/graphx/)
- Batch SQL
- [Presto] (https://prestodb.io/docs/current/index.html)
- [Hive] (http://hive.apache.org)
- [Drill] (https://drill.apache.org/)
- [Flask] (http://flask.pocoo.org/)
- [D3] (http://d3js.org/)
- [D3Plus] (http://d3plus.org) D3's simplier, easier to use cousin. Mostly predefined templates that you can just plug data in.
- [AngularJS] (https://angularjs.org/)
- [Django] (https://www.djangoproject.com/)
- [Highcharts] (http://www.highcharts.com/)
- C3.js D3-based reusable chart library
- Flocker Easily manage Docker containers & their data
- [GitHub Archive] (https://www.githubarchive.org/) GitHub's public timeline since 2011, updated every hour
- [Common Crawl] (https://commoncrawl.org/) Open source repository of web crawl data
Cheers to The Data Engineering Ecosystem: An Interactive Map
Inspired by the awesome list. Created by Insight Data Engineering fellows.