There are 350 repositories under big-data topic.
The Patterns of Scalable, Reliable, and Performant Large-Scale Systems
Apache Spark - A unified analytics engine for large-scale data processing
ClickHouse® is a free analytics DBMS for big data
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.
An open source cybersecurity protocol for syncing decentralized graph data.
The official home of the Presto distributed SQL query engine for big data
PredictionIO, a machine learning server for developers and ML engineers.
The Data Engineering Cookbook
An open source time-series database for fast ingest and SQL queries
A distributed, fast open-source graph database featuring horizontal scalability and high availability
The most widely used Python to C compiler
Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.
Apache Beam is a unified programming model for Batch and Streaming data processing.
H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.
An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
Open-Source Web UI for Apache Kafka Management
Data-Centric Pipelines and Data Versioning
Seamless multi-master syncing database with an intuitive HTTP/JSON API, designed for reliability
Arkime (formerly Moloch) is an open source, large scale, full packet capturing, indexing, and database system.
Open-source distributed computation and storage platform. Real-time Stream Processing Unconference. Save Your Spot https://hazelcast.com/lp/unconference/
Stream Framework is a Python library, which allows you to build news feed, activity streams and notification systems using Cassandra and/or Redis. The authors of Stream-Framework also provide a cloud service for feed technology:
StarRocks is a next-gen sub-second MPP database for full analytics scenarios, including multi-dimensional analytics, real-time analytics and ad-hoc query.
The open big data serving engine. https://vespa.ai
Feature Store for Machine Learning
Simple and Distributed Machine Learning
⚡️A vue component support big amount data list with high render performance and efficient.
Apache Arrow DataFusion SQL Query Engine