There are 392 repositories under spark topic.
Apache Spark - A unified analytics engine for large-scale data processing
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.
Make Your Company Data Driven. Connect to any data source, easily visualize, dashboard and share your data.
Learn and understand Docker&Container technologies, with real DevOps practice!
Free Data Engineering course!
flink learning blog. http://www.54tianzhisheng.cn/ 含 Flink 入门、概念、原理、实战、性能调优、源码解析等内容。涉及 Flink Connector、Metrics、Library、DataStream API、Table API & SQL 等内容的学习案例，还有 Flink 落地应用的大型项目案例（PVUV、日志存储、百亿数据实时去重、监控告警）分享。欢迎大家支持我的专栏《大数据实时计算引擎 Flink 实战与性能优化》
Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.
Suite of tools for deploying and training deep learning models using the JVM. Highlights include model import for keras, tensorflow, and onnx/pytorch, a modular and tiny c++ library for running math code and a java based math library on top of the core c++ library. Also includes samediff: a pytorch/tensorflow like library for running deep learning using automatic differentiation.
List of Data Science Cheatsheets to rule the world
Apache Doris is an easy-to-use, high performance and unified analytics database.
A Flexible and Powerful Parameter Server for large-scale machine learning
H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
Alluxio, data orchestration for analytics and machine learning in the cloud
Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.
An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
🧙 The modern replacement for Airflow. Build, run, and manage data pipelines for integrating and transforming data.
Simple and Distributed Machine Learning
Fast, distributed, secure AI for Big Data
PipelineAI Kubeflow Distribution
TensorFlowOnSpark brings TensorFlow programs to Apache Spark clusters.
The Hunting ELK
酷玩 Spark: Spark 源代码解析、Spark 类库等
Python SQL Parser and Transpiler
State of the Art Natural Language Processing
Koalas: pandas API on Apache Spark
🔨 用 JSON 来生成结构化的 SQL 语句，基于 Vue3 + TypeScript + Vite + Ant Design + MonacoEditor 实现，项目简单（重逻辑轻页面）、适合练手~
Interactive and Reactive Data Science using Scala and Spark.
A better compressed bitset in Java
Apache Linkis builds a computation middleware layer to facilitate connection, governance and orchestration between the upper applications and the underlying data engines.
REST job server for Apache Spark
Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
Python clone of Spark, a MapReduce alike framework in Python