nave91 / awesome-data-engineering

A curated list of data engineering tools for software developers

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Awesome Data Engineering

A curated list of data engineering tools for software developers

List of content

  1. [Databases] (#databases)
  2. Ingestion
  3. [File System] (#file-system)
  4. File Format
  5. Stream Processing
  6. [Batch Processing] (#batch-processing)
  7. [Front End] (#front-end)

Databases

Data Ingestion

File System

File Format

  • Apache Avro Apache Avro™ is a data serialization system
  • Apache Parquet Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.
  • Apache Thrift The Apache Thrift software framework, for scalable cross-language services development
  • ProtoBuf Protocol Buffers - Google's data interchange format
  • SequenceFile SequenceFile is a flat file consisting of binary key/value pairs. It is extensively used in MapReduce as input/output formats

Stream Processing

  • Spark Streaming Spark Streaming makes it easy to build scalable fault-tolerant streaming applications.
  • Apache Flink Apache Flink is a streaming dataflow engine that provides data distribution, communication, and fault tolerance for distributed computations over data streams.
  • Apache Storm Apache Storm is a free and open source distributed realtime computation system
  • Apache Samza Apache Samza is a distributed stream processing framework
  • Apache NiFi is an easy to use, powerful, and reliable system to process and distribute data

Batch Processing

Front End

ELK Elastic Logstash Kebana

Docker

  • Flocker Easily manage Docker containers & their data

Datasets

Realtime

Data Dumps

Cheers to The Data Engineering Ecosystem: An Interactive Map

Inspired by the awesome list. Created by Insight Data Engineering fellows.

About

A curated list of data engineering tools for software developers