Awesome OLAP
A curated list of awesome open-source Online Analytical Processing databases, frameworks, ressources, tools and other awesomeness, for data engineers.
- OLAP Databases
- Data lake
- Brokers and distributed messaging
- Ingestion and querying
- Scheduler
- ETL, ELT and reverse ETL
- Datasets
- Benchmark
- Readings
- People to follow
- Events
- Communities
OLAP Databases
Real-time analytics
The following columnar databases use a shared-nothing architecture and provide a sub-second response time. DDL, DML and DCL are operated via SQL. These databases also support tiering for long-term cold storage.
Search engines
- Elasticsearch - Search and analytics engine based on Apache Lucene.
- Meilisearch - Open source search engine, that aims to be a ready-to-go solution.
- OpenSearch - Apache 2.0 fork of Elasticsearch.
- Quickwit - Search engine on top of object storage, using shared-everything architecture.
- Typesense - Оpen-source, typo-tolerant search engine optimized for instant search-as-you-type experiences and developer productivity.
NewSQL
- Citus - PostgreSQL compatible distributed table.
- TiDB - MySQL compatible SQL database that supports hybrid transactional and analytical processing workloads.
Timeseries
- Grafana Mimir - Prometheus compatible TSDB on top of object storage.
- TimeScaleDB - PostgreSQL compatible TSDB.
Managed cloud services
Data lake
The data lake approach (or "lakehouse") is a semi-structured schema that sit on top of object storage in the cloud.
It is composed of a few layers (from lower to higher level): codec, file format, table format + metastore, and the ingestion/query layer.
File formats and serialization
These formats are popular for shared-everything databases, using object storage as persistence layer. The data is organized in row or column, with strict schema definition. These files are immutable and offer partial reads (only headers, metadata, data page, etc). Mutation require a new upload. Most formats support nested schema, codecs, compression and data encryption. Index can be added to file metadata for faster processing.
A single file can weight between tens of MB to a few GB. Lot of small files require more merge operation. Larger files can be costly to update.
- Apache Arrow Columnar Format - Columnar format for in-memory Apache Arrow processing.
- Apache Avro - Row-oriented serialization for data streaming purpose.
- Apache ORC - Column-oriented serialization for data storage purpose. Part of Hadoop platform.
- Apache Parquet - Column-oriented serialization for data storage purpose.
- Apache Thrift - Row-oriented serialization for RPC purpose.
- Google Protobuf - Row-oriented serialization for RPC purpose.
- Schema Registry - Centralized repository for validating row-oriented events. Part of Kafka and Confluent platform.
Open table formats
Open table formats are abstraction layer on top of Avro/Parquet files, with support for ACID transaction, CDC, partitioning, mixed streaming/batching processing, schema evolution and mutation. Schema and statistics are stored in a metastore, data is persisted locally or in a remote/cloud object storage.
Open tables are a cost-effective datawarehouse for petabyte scale.
Comparison:
- (2022) https://medium.com/geekculture/open-table-formats-delta-iceberg-hudi-732f682ec0bb
- (2023) https://aws.amazon.com/blogs/big-data/choosing-an-open-table-format-for-your-transactional-data-lake-on-aws/
👆 Warning: pre-2022 articles should be considered as out-of-date, as open table formats are evolving quickly.
Metastore
- AWS Glue
- Databricks unity catalog
- Hive Metastore - Component of Hadoop HiveServer2, that can be used standalone.
- Nessie
Object Storage
- Apache HDFS - Hadoop distributed file system.
- AWS S3
- Azure Blob Storage
- GCP Cloud Storage
- Minio - S3 compatible and self-hosted object storage.
Codecs, encoding and compression
Brokers and distributed messaging
Ingestion and querying
Stream processing
Process a set of data in real-time (or near-real-time), as it is being generated.
- Apache Beam - Unified SDK for cross language stream processing. Available in Go, Python, Java, Scala and Typescript.
- Apache Flink - Stateful stream processing.
- Apache Kafka stream - Stream processing.
- Apache Spark streaming - Stream processing on top of Spark.
- Akka stream - Stream processing.
- Benthos - Go stream processing.
Batch processing
Process periodically a large amount of data in a single batch.
In-memory processing
Non real-time SQL queries executed against a large database can be processed locally. This method might not fit into memory or lead to very long job duration.
- Apache Arrow - Low-level in-memory data processing. Zero-copy data manipulation for any language, via gRPC/IPC interfaces.
- Apache Arrow Datafusion - High level SQL interface for Apache Arrow.
- delta-rs - Standalone DeltaLake driver for Python and Rust. Do not depend on Spark.
- Delta Standalone - Standalone DeltaLake driver for Java and Scala. Do not depend on Spark.
- DuckDB - In-process SQL query engine for processing Parquet files. Built on top of Apache Arrow.
- Pandas - Python data analysis and manipulation tool.
- clickhouse-local - Lightweight CLI version of Clickhouse for running SQL queries against CSV, JSON, Parquet, etc files.
Distributed SQL processing
These SQL engines distribute SQL queries processing of very large database on a cluster. Support of ANSI SQL.
- Apache Spark SQL - Distributed SQL query engine that sit on top of Spark.
- ksql - SQL interface for Kafka.
- PrestoDB - Distributed SQL query engine.
- Trino - Distributed SQL query engine. Fork of PrestoDB.
Scheduler
These tools allow to orchestrate, schedule and monitor repetitive data transformations, in a workflow manner.
ETL, ELT and reverse ETL
The popular acronym for Extracting, Transforming and Loading data. ELT performs data transformations directly within the data warehouse. Reverse ETL is the process of copying data from your datawarehouse to external tools or SaaS.
- Airbyte - ELT.
- Census - Reverse ETL.
- RudderStack - Customer Data Platform. Pipeline between a tracking plan, event transformation, and destination tools (datawarehouse or SaaS).
Datasets
- awesome-public-datasets
- CommonCrawl
- Criteo
- Entso-e
- GitHub Archives
- Kaggle - Community sourced dataset.
- NYCTaxy
Benchmark
- Jepsen - Distributed databases, queues and consensus protocols testing.
- TPC family benchmarks - For big data based database.
Readings
Papers
Architecture
Data modeling
Index
- Partitioning
- Data skipping
- Statistics
- High cardinality
- HyperLogLog
- Bloom filters
- Minmax
- Z-ordering
- Bitmap index
- Dense index
- Sparse index
- Reverse index
- N-gram
- TF-IDF
- LSM Tree
Vector similarity search
Vectorized query processing
- Apache Arrow vectorized execution
- Apache Arrow SIMD parallel processing
- Cockroach vectorized JOIN
- Latency comparison numbers
Querying
Transactions
Consensus
Challenging platforms
Blogs to follow
- Engineering at Meta
- Engineering at Criteo
- Engineering at Uber
- Engineering at Airbnb
- Databricks
- Towards Data Science
More
- https://www.moderndatastack.xyz/
- https://books.japila.pl/
- https://jepsen.io/analyses
- https://github.com/aphyr/distsys-class
People to follow
// TODO
Events
// TODO
Communities
// TODO
🤝 Contributing
Contributions of any kind welcome, just follow the guidelines!
👤 Contributors
💫 Show your support
Give a ⭐️ if this project helped you!
📝 License
Copyright © 2023 Samuel Berthe.
This project is MIT licensed.