big-data

There are 437 repositories under big-data topic.

awesome-scalability
binhnguyennus / awesome-scalability
The Patterns of Scalable, Reliable, and Performant Large-Scale Systems
architecture awesome awesome-list backend big-data computer-science design-patterns devops distributed-systems interview interview-practice interview-questions lists machine-learning programming resources scalability system system-design web-development
66576
ClickHouse
ClickHouse / ClickHouse
ClickHouse® is a real-time analytics database management system
ai analytics big-data clickhouse cloud-native cpp database dbms distributed embedded hacktoberfest lakehouse mpp olap rust self-hosted sql
Language:C++ 43878
apache / spark
Apache Spark - A unified analytics engine for large-scale data processing
python scala r java big-data jdbc sql spark
Language:Scala 42240
donnemartin / data-science-ipython-notebooks
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.
aws big-data caffe data-science deep-learning hadoop kaggle keras machine-learning mapreduce matplotlib numpy pandas python scikit-learn scipy spark tensorflow theano
Language:Python 28640
apache / flink
Apache Flink
scala java big-data flink python sql
Language:Java 25451
amark / gun
An open source cybersecurity protocol for syncing decentralized graph data.
artificial-intelligence big-data blockchain crdt crypto cryptography dapp database decentralized dweb encryption end-to-end graph machine-learning metaverse offline-first p2p protocol realtime web3
Language:JavaScript 18746
heibaiying / BigData-Notes
大数据入门指南 :star:
hadoop hdfs yarn mapreduce hive spark storm hbase scala kafka zookeeper flume azkaban sqoop phoenix bigdata big-data
Language:Java 16734
presto
prestodb / presto
The official home of the Presto distributed SQL query engine for big data
big-data data hadoop hive java lakehouse presto query sql
Language:Java 16558
andkret / Cookbook
The Data Engineering Cookbook
best-practices big-data cookbook data-engineer data-engineering
Language:Python 14613
apache / predictionio
PredictionIO, a machine learning server for developers and ML engineers.
big-data predictionio scala
Language:Scala 12529
trinodb / trino
Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
java presto hive hadoop big-data sql prestodb database databases distributed-systems distributed-database data-science datalake jdbc query-engine trino analytics delta-lake iceberg
Language:Java 12119
yahoo / CMAK
CMAK is a tool for managing Apache Kafka clusters
kafka scala cluster-management big-data
Language:Scala 11937
nebula
vesoft-inc / nebula
A distributed, fast open-source graph database featuring horizontal scalability and high availability
graph-database distributed database graphdb raft cpp nebula-graph nebula graph nebulagraph big-data distributed-systems scalability hacktoberfest
Language:C++ 11799
kafka-ui
provectus / kafka-ui
Open-Source Web UI for Apache Kafka Management
apache-kafka big-data cluster-management event-streaming hacktoberfest kafka kafka-brokers kafka-client kafka-cluster kafka-connect kafka-manager kafka-producer kafka-streams kafka-ui opensource streaming-data streams web-ui
Language:Java 11548
StarRocks / starrocks
The world's fastest open query engine for sub-second analytics both on and off the data lakehouse. With the flexibility to support nearly any scenario, StarRocks provides best-in-class performance for multi-dimensional analytics, real-time analytics, and ad-hoc queries. A Linux Foundation project.
analytics big-data cloudnative database datalake delta-lake distributed-database hudi iceberg join lakehouse lakehouse-platform mpp olap real-time-analytics real-time-updates realtime-database sql star-schema vectorized
Language:Java 10847
quickwit
quickwit-oss / quickwit
Cloud-native search engine for observability. An open-source alternative to Datadog, Elasticsearch, Loki, and Tempo.
big-data cloud-native cloud-storage distributed-tracing log-management logs open-source rust search-engine tantivy
Language:Rust 10516
cython / cython
The most widely used Python to C compiler
python cython cpython cpython-extensions c cpp performance big-data
Language:Python 10428
catboost / catboost
A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.
machine-learning decision-trees gradient-boosting gbm gbdt python r kaggle gpu-computing catboost tutorial categorical-features gpu coreml data-science big-data cuda data-mining
Language:C++ 8645
delta-io / delta
An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
spark acid big-data analytics delta-lake
Language:Scala 8386
apache / beam
Apache Beam is a unified programming model for Batch and Streaming data processing.
batch beam big-data golang java python sql streaming
Language:Java 8361
apache / datafusion
Apache DataFusion SQL Query Engine
arrow big-data dataframe datafusion olap python query-engine rust sql
Language:Rust 7991
h2oai / h2o-3
H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
automl big-data data-science deep-learning distributed ensemble-learning gbm gpu h2o h2o-automl hadoop java machine-learning naive-bayes opensource pca python r random-forest spark
Language:Jupyter Notebook 7356
arkime / arkime
Arkime is an open source, large scale, full packet capturing, indexing, and database system.
big-data c javascript network-monitoring nsm packet-capture pcap security
Language:JavaScript 7174
couchdb
apache / couchdb
Seamless multi-primary syncing database with an intuitive HTTP/JSON API, designed for reliability
big-data cloud content couchdb database erlang http javascript network-client network-server
Language:Erlang 6719
apache / zeppelin
Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.
scala java big-data zeppelin javascript spark flink database nosql
Language:Java 6577
vespa
vespa-engine / vespa
AI + Data, online. https://vespa.ai
ai big-data java machine-learning rag search search-engine server serving-recommendation tensor vector vector-database vector-search vespa
Language:Java 6559
hazelcast / hazelcast
Hazelcast is a unified real-time data platform combining stream processing with a fast data store, allowing customers to act instantly on data-in-motion for real-time insights.
java hazelcast in-memory big-data scalability distributed caching hacktoberfest stream-processing low-latency distributed-computing distributed-systems data-in-motion data-insights real-time
Language:Java 6461
feast-dev / feast
The Open Source Feature Store for AI/ML
big-data data-engineering data-quality data-science feature-store features machine-learning ml mlops python
Language:Python 6459
pachyderm / pachyderm
Data-Centric Pipelines and Data Versioning
analytics big-data containers data-analysis data-science distributed-systems docker go kubernetes pachyderm
Language:Go 6263
apache / iotdb
Apache IoTDB
big-data database iot java nosql timeseries tsdb
Language:Java 6225
apache / hive
Apache Hive
java hive database sql apache big-data hadoop
Language:Java 5865
SynapseML
microsoft / SynapseML
Simple and Distributed Machine Learning
spark pyspark azure scala microsoft ml machine-learning databricks cognitive-services lightgbm http model-deployment deep-learning ai apache-spark data-science synapse big-data onnx opencv
Language:Scala 5175
apache / ignite
Apache Ignite
big-data cache cloud data-management-platform database distributed-sql-database hadoop ignite in-memory-computing in-memory-database iot network-client network-server osgi sql
Language:Java 5001
apache / calcite
Apache Calcite
geospatial calcite java big-data hadoop sql
Language:Java 4977
tschellenbach / Stream-Framework
Stream Framework is a Python library, which allows you to build news feed, activity streams and notification systems using Cassandra and/or Redis. The authors of Stream-Framework also provide a cloud service for feed technology:
activity-stream cassandra feed news news-feed big-data redis activity-feed
Language:Python 4736
Daft
Eventual-Inc / Daft
Distributed query engine providing simple and reliable data processing for any modality and scale
artificial-intelligence big-data data-engineering distributed-computing machine-learning multimodal python rust
Language:Rust 4677

big-data

binhnguyennus / awesome-scalability

ClickHouse / ClickHouse

apache / spark

donnemartin / data-science-ipython-notebooks

apache / flink

amark / gun

heibaiying / BigData-Notes

prestodb / presto

andkret / Cookbook

apache / predictionio

trinodb / trino

yahoo / CMAK

vesoft-inc / nebula

provectus / kafka-ui

StarRocks / starrocks

quickwit-oss / quickwit

cython / cython

catboost / catboost

delta-io / delta

apache / beam

apache / datafusion

h2oai / h2o-3

arkime / arkime

apache / couchdb

apache / zeppelin

vespa-engine / vespa

hazelcast / hazelcast

feast-dev / feast

pachyderm / pachyderm

apache / iotdb

apache / hive

microsoft / SynapseML

apache / ignite

apache / calcite

tschellenbach / Stream-Framework

Eventual-Inc / Daft