yangliuyu / awesome-modern-bigdata

A list of awesome modern big data libraries, frameworks and platforms.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

awesome-modern-bigdata

A list of awesome modern big data libraries, frameworks and platforms.

Computing

  • Flink Stateful Computations over Data Streams.
  • Spark Apache Spark™ is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters.

Orchestration

  • NiFi Apache NiFi supports powerful and scalable directed graphs of data routing, transformation, and system mediation logic.
  • StreamPipes A self-service (Industrial) IoT toolbox to enable non-technical users to connect, analyze and explore IoT data streams.

Ingestion

  • Debezium Debezium is an open source distributed platform for change data capture.
  • Flink CDC CDC Connectors for Apache Flink® is a set of source connectors for Apache Flink®, ingesting changes from different databases using change data capture (CDC)

File Storage

  • MINIO MinIO offers high-performance, S3 compatible object storage.
  • JuiceFS JuiceFS is a high-performance shared file system designed for cloud-native use and released under the Apache License 2.0. It provides full POSIX compatibility, allowing almost all kinds of object storage to be used locally as massive local disks and to be mounted and read on different cross-platform and cross-region hosts at the same time.
  • Fluid Fluid is an open source Kubernetes-native Distributed Dataset Orchestrator and Accelerator for data-intensive applications, such as big data and AI applications.
  • ALLUXIO Alluxio, data orchestration for analytics and machine learning in the cloud.

OLAP Query Engine

  • Presto Presto is a distributed SQL query engine for big data.

Messaging

  • Kafka Apache Kafka is an open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications.
  • Pulsar Apache Pulsar is a cloud-native, distributed messaging and streaming platform originally created at Yahoo! and now a top-level Apache Software Foundation project.

Database

  • Clickhouse ClickHouse® is a column-oriented database management system (DBMS) for online analytical processing of queries (OLAP).
  • StarRocks StarRocks is a next-gen sub-second MPP database for full analytics scenarios, including multi-dimensional analytics, real-time analytics and ad-hoc query.
  • TiKV TiKV is an open-source, distributed, and transactional key-value database. Unlike other traditional NoSQL systems, TiKV not only provides classical key-value APIs, but also transactional APIs with ACID compliance.

Data Lake

  • Iceberg Iceberg is a high-performance format for huge analytic tables. Iceberg brings the reliability and simplicity of SQL tables to big data, while making it possible for engines like Spark, Trino, Flink, Presto, and Hive to safely work with the same tables, at the same time.
  • Hudi Hudi is a rich platform to build streaming data lakes with incremental data pipelines on a self-managing database layer, while being optimized for lake engines and regular batch processing.
  • Delta Lake Delta Lake is an open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs for Scala, Java, Rust, Ruby, and Python.
  • Flink Table Store Flink Table Store is a unified streaming and batch store for building dynamic tables on Apache Flink.

Metadata

  • Datahub DataHub is an open-source metadata platform for the modern data stack.
  • Amundsen Amundsen is a data discovery and metadata engine for improving the productivity of data analysts, data scientists and engineers when interacting with data.

Data Quality

Network

Monitoring

Data Analytics

  • Zeppelin Zeppelin, a web-based notebook that enables interactive data analytics. You can make beautiful data-driven, interactive and collaborative documents with SQL, Scala and more.

Data Visualization

  • Superset Apache Superset is a modern data exploration and visualization platform.
  • Davinci Davinci is oriented towards product managers, business people, data engineers, data analysts, data scientists, etc.
  • DataEase DataEase is an open source data visualization analysis tool that helps users quickly analyze data and gain insight into business trends, so as to achieve business improvement and optimization.

About

A list of awesome modern big data libraries, frameworks and platforms.

License:MIT License