Sharing interesting and noteworthy Data Engineering content - namely blogs, podcasts, repos, books, videos, and MOOCs. This was mostly curated by and for Fellows in the Insight Data Engineering Fellows Program, and inspired by the repo of one of our Fellows, Igor Barinov.
If you have ideas or other interesting resources, feel free to open an Issue or Pull Request.
Table of Contents
Technologies
All technologies are listed alphabetically in their given section.
Overviews
File Formats
Avro
ORCFiles
Parquet
Protocol Buffers
Thrift
File Systems
Hadoop Distributed File System (HDFS)
S3
Blogs
- Excellent summary of the history of Hadoop by Marco Bonaci. This post is also read as a podcast by Software Engineering Daily.
Databases
Overviews
- Jepsen - Kyle Kingsbury's (Aphyr) guide on distributed systems and databases, and how they fail.
Relational Databases
MySQL
Postgres
Key-Value Databases
Redis
Riak
Column-Family Databases
Accumulo
Cassandra
Blogs
- Nice post about using
clustering order by
in Cassandra - Post by Datastax about basics of data modeling in Cassandra
HBase
Graph Databases
Neo4j
OrientDB
Search Tools
Elasticsearch
Lucene
Solr
General Batch Processing
Hadoop MapReduce
Blogs
- Excellent summary of the history of Hadoop by Marco Bonaci. This post is also read as a podcast by Software Engineering Daily.
Hadoop Abstractions
Cascalog
Cascading
Hadoop Streaming / mrjob
Hive
Pig
Scalding
Spark
Graph Processing
Giraph
GraphLab Create
Spark GraphX
Machine Learning Tools
FlinkML
H2O
Mahout
Spark MLlib
Stream Processing
Flink
Slides
- Strata Talk by Kostas Tzoumas on Flink Streaming's capabilities.
- Streaming Benchmark talk by Jamie Grier on extending Yahoo's Benchmark, based off this blog
Blogs
- Asynchronous Snapshots Blog by Data Artisans, and a summary in the morning paper
Papers
- Millwheel Paper which discusses Low Watermarks for Exactly-Once Semantics
- Asynchronous Snapshots Barrier Paper describing Flink's snapshot algorithm
- Chandy-Lamport Paper on Distributed Snapshots, and a summary in the morning paper
NiFi
Samza
Spark Streaming
Storm
Ingestion Tools
Flume
Logstash
Messaging Queues / PubSub
Kafka
Blogs
-
Part 1 of series of 3 blogs on how Datadog monitors Kafka. Part 1 is an especially good intro to Kafka's architecture.
Podcasts
Videos
- [Video] (https://www.youtube.com/watch?v=aJuo_bLSW6s&feature=youtu.be) by Jay Kreps on logs, stream processing and Kafka
RabbitMQ
ZeroMQ
Workflow and Scheduling
Airflow
Podcasts
- [Interview with Maxime Beauchemin](Software Engineering Daily) on Airflow, Airpal, and Caravel on Software Engineering Daily.
Azkaban
Luigi
Oozie
Cluster Management and Coordination
Docker
Kubernetes
Mesos
YARN
Zookeeper
Important Algorithms and Theorems
- List of 100 Seminal Data Engineering Papers from Anil Madan
Distributed Systems
- General Notes from Kyle Kingsbury (Aphyr) on Distributed Sytems
Paxos
- Visualization of Paxos with explanation
RAFT
MapReduce
Distributed graph and machine learning algorithms
Papers
- [Paper] (https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf) on using z-values for implementing approximate k-nearest neighbors in a MapReduce framework. There is also a Background paper on the topic, describing the non-distributed version.
- [Paper] (http://ds.qcri.org/images/profile/tarek_elgamal/sigmod2015.pdf) on sPCA -- Scalable Principal Component Analysis
Gossip Protocol
Chandy-Lamport
- Chandy-Lamport Paper on Distributed Snapshots, and a summary in the morning paper
Load Balancing
Transactions
CAP Theorem
- Blog on nuances of the CAP theorem by Nicolas Liochon
Background and Interview Prep
- Repo of awesome computer science courses.
General Guidance for Interviews
Excellent post on preparing for interview from TripleByte, both technically and strategically
Data Structures and Algorithms
MOOCs
Books
- Cracking the Coding Interview, with solutions in many languages here
Practice Websites
SQL and Database Design
MOOCs
- Jennifer Widom's self-paced MOOC from first principles, based off her Stanford course.
Practice Websites
System Design
- Repo of many sytem design studies, resources, and strategies.
Software Engineering Best Practices
Programming Languages
Learning Linux Commands
Operating Systems and Networking
- Excellent Review of Fair Scheduling in Linux from The Morning Paper.
- Blog on the impact of saving CPU cycles while processing billions of records and the effects of tuning CPU from the Localytics engineering team.