tiredoftools / Awesome-Data-Engineering-Content

Sharing interesting and noteworthy Data Engineering content

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Sharing interesting and noteworthy Data Engineering content - namely blogs, podcasts, repos, books, videos, and MOOCs. This was mostly curated by and for Fellows in the Insight Data Engineering Fellows Program, and inspired by the repo of one of our Fellows, Igor Barinov.

If you have ideas or other interesting resources, feel free to open an Issue or Pull Request.

Table of Contents

Technologies

All technologies are listed alphabetically in their given section.

Overviews

File Formats

Avro

ORCFiles

Parquet

Protocol Buffers

Thrift

File Systems

Hadoop Distributed File System (HDFS)

S3

Blogs

Databases

Overviews

  • Jepsen - Kyle Kingsbury's (Aphyr) guide on distributed systems and databases, and how they fail.

Relational Databases

MySQL

Postgres

Key-Value Databases

Redis

Riak

Column-Family Databases

Accumulo

Cassandra

Blogs
  • Nice post about using clustering order by in Cassandra
  • Post by Datastax about basics of data modeling in Cassandra

HBase

Graph Databases

Neo4j

OrientDB

Search Tools

Elasticsearch

Lucene

Solr

General Batch Processing

Hadoop MapReduce

Blogs

Hadoop Abstractions

Cascalog

Cascading

Hadoop Streaming / mrjob

Hive

Pig

Scalding

Spark

Graph Processing

Giraph

GraphLab Create

Spark GraphX

Machine Learning Tools

FlinkML

H2O

Mahout

Spark MLlib

Stream Processing

Flink

Slides

Blogs

Papers

NiFi

Samza

Spark Streaming

Storm

Ingestion Tools

Flume

Logstash

Messaging Queues / PubSub

Kafka

Blogs

  • Part 1 and Part 2 of Jay Krep's on streams in Kafka

  • Part 1 of series of 3 blogs on how Datadog monitors Kafka. Part 1 is an especially good intro to Kafka's architecture.

Podcasts

Videos

RabbitMQ

ZeroMQ

Workflow and Scheduling

Airflow

Podcasts

  • [Interview with Maxime Beauchemin](Software Engineering Daily) on Airflow, Airpal, and Caravel on Software Engineering Daily.

Azkaban

Luigi

Oozie

Cluster Management and Coordination

Docker

Kubernetes

Mesos

YARN

Zookeeper

Important Algorithms and Theorems

Distributed Systems

  • General Notes from Kyle Kingsbury (Aphyr) on Distributed Sytems

Paxos

RAFT

MapReduce

Distributed graph and machine learning algorithms

Papers

Gossip Protocol

Chandy-Lamport

Load Balancing

Transactions

CAP Theorem

  • Blog on nuances of the CAP theorem by Nicolas Liochon

Background and Interview Prep

  • Repo of awesome computer science courses.

General Guidance for Interviews

Excellent post on preparing for interview from TripleByte, both technically and strategically

Data Structures and Algorithms

MOOCs

  • Part 1 and Part 2 of Tim Roughgarden's MOOC, based off his Stanford course.

Books

Practice Websites

SQL and Database Design

MOOCs

Practice Websites

System Design

  • Repo of many sytem design studies, resources, and strategies.

Software Engineering Best Practices

Programming Languages

Learning Linux Commands

Operating Systems and Networking

  • Excellent Review of Fair Scheduling in Linux from The Morning Paper.
  • Blog on the impact of saving CPU cycles while processing billions of records and the effects of tuning CPU from the Localytics engineering team.

About

Sharing interesting and noteworthy Data Engineering content