analytics awesome awesome-list database datalake lakehouse metastore olap processing sql batch-processing big-data datawarehouse stream-processing streaming timeseries

Awesome OLAP

A curated list of awesome open-source Online Analytical Processing databases, frameworks, ressources, tools and other awesomeness, for data engineers.

OLAP Databases
Data lake
Brokers and distributed messaging
Ingestion and querying
Scheduler
ETL, ELT and reverse ETL
Datasets
Benchmark
Readings
People to follow
Events
Communities

OLAP Databases

Real-time analytics

The following columnar databases use a shared-nothing architecture and provide a sub-second response time. DDL, DML and DCL are operated via SQL. These databases also support tiering for long-term cold storage.

Search engines

Elasticsearch - Search and analytics engine based on Apache Lucene.
Meilisearch - Open source search engine, that aims to be a ready-to-go solution.
OpenSearch - Apache 2.0 fork of Elasticsearch.
Quickwit - Search engine on top of object storage, using shared-everything architecture.
Typesense - Оpen-source, typo-tolerant search engine optimized for instant search-as-you-type experiences and developer productivity.

NewSQL

Citus - PostgreSQL compatible distributed table.
TiDB - MySQL compatible SQL database that supports hybrid transactional and analytical processing workloads.

Timeseries

Grafana Mimir - Prometheus compatible TSDB on top of object storage.
TimeScaleDB - PostgreSQL compatible TSDB.

Managed cloud services

Data lake

The data lake approach (or "lakehouse") is a semi-structured schema that sit on top of object storage in the cloud.

It is composed of a few layers (from lower to higher level): codec, file format, table format + metastore, and the ingestion/query layer.

File formats and serialization

These formats are popular for shared-everything databases, using object storage as persistence layer. The data is organized in row or column, with strict schema definition. These files are immutable and offer partial reads (only headers, metadata, data page, etc). Mutation require a new upload. Most formats support nested schema, codecs, compression and data encryption. Index can be added to file metadata for faster processing.

A single file can weight between tens of MB to a few GB. Lot of small files require more merge operation. Larger files can be costly to update.

Apache Arrow Columnar Format - Columnar format for in-memory Apache Arrow processing.
Apache Avro - Row-oriented serialization for data streaming purpose.
Apache ORC - Column-oriented serialization for data storage purpose. Part of Hadoop platform.
Apache Parquet - Column-oriented serialization for data storage purpose.
Apache Thrift - Row-oriented serialization for RPC purpose.
Google Protobuf - Row-oriented serialization for RPC purpose.
Schema Registry - Centralized repository for validating row-oriented events. Part of Kafka and Confluent platform.

Open table formats

Open table formats are abstraction layer on top of Avro/Parquet files, with support for ACID transaction, CDC, partitioning, mixed streaming/batching processing, schema evolution and mutation. Schema and statistics are stored in a metastore, data is persisted locally or in a remote/cloud object storage.

Open tables are a cost-effective datawarehouse for petabyte scale.

Comparison:

👆 Warning: pre-2022 articles should be considered as out-of-date, as open table formats are evolving quickly.

Metastore

AWS Glue
Databricks unity catalog
Hive Metastore - Component of Hadoop HiveServer2, that can be used standalone.
Nessie

Object Storage

Apache HDFS - Hadoop distributed file system.
AWS S3
Azure Blob Storage
GCP Cloud Storage
Minio - S3 compatible and self-hosted object storage.

Codecs, encoding and compression

Brokers and distributed messaging

Ingestion and querying

Stream processing

Process a set of data in real-time (or near-real-time), as it is being generated.

Apache Beam - Unified SDK for cross language stream processing. Available in Go, Python, Java, Scala and Typescript.
Apache Flink - Stateful stream processing.
Apache Kafka stream - Stream processing.
Apache Spark streaming - Stream processing on top of Spark.
Akka stream - Stream processing.
Benthos - Go stream processing.

Batch processing

Process periodically a large amount of data in a single batch.

In-memory processing

Non real-time SQL queries executed against a large database can be processed locally. This method might not fit into memory or lead to very long job duration.

Apache Arrow - Low-level in-memory data processing. Zero-copy data manipulation for any language, via gRPC/IPC interfaces.
Apache Arrow Datafusion - High level SQL interface for Apache Arrow.
delta-rs - Standalone DeltaLake driver for Python and Rust. Do not depend on Spark.
Delta Standalone - Standalone DeltaLake driver for Java and Scala. Do not depend on Spark.
DuckDB - In-process SQL query engine for processing Parquet files. Built on top of Apache Arrow.
Pandas - Python data analysis and manipulation tool.
clickhouse-local - Lightweight CLI version of Clickhouse for running SQL queries against CSV, JSON, Parquet, etc files.

Distributed SQL processing

These SQL engines distribute SQL queries processing of very large database on a cluster. Support of ANSI SQL.

Apache Spark SQL - Distributed SQL query engine that sit on top of Spark.
ksql - SQL interface for Kafka.
PrestoDB - Distributed SQL query engine.
Trino - Distributed SQL query engine. Fork of PrestoDB.

Scheduler

These tools allow to orchestrate, schedule and monitor repetitive data transformations, in a workflow manner.

ETL, ELT and reverse ETL

The popular acronym for Extracting, Transforming and Loading data. ELT performs data transformations directly within the data warehouse. Reverse ETL is the process of copying data from your datawarehouse to external tools or SaaS.

Airbyte - ELT.
Census - Reverse ETL.
RudderStack - Customer Data Platform. Pipeline between a tracking plan, event transformation, and destination tools (datawarehouse or SaaS).

Datasets

Benchmark

Jepsen - Distributed databases, queues and consensus protocols testing.
TPC family benchmarks - For big data based database.

Readings

Papers

Architecture

Data modeling

Index

Vector similarity search

Vectorized query processing

Querying

Transactions

Consensus

Paxos
Raft

Challenging platforms

Blogs to follow

More

People to follow

// TODO

Events

// TODO

Communities

// TODO

🤝 Contributing

Contributions of any kind welcome, just follow the guidelines!

👤 Contributors

💫 Show your support

Give a ⭐️ if this project helped you!

📝 License

This project is MIT licensed.

About

A curated list of awesome Online Analytical Processing databases, frameworks, ressources and other awesomeness.

analytics awesome awesome-list database datalake lakehouse metastore olap processing sql batch-processing big-data datawarehouse stream-processing streaming timeseries

MIT License