There are 20 repositories under data-lake topic.
Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses.
Few projects related to Data Engineering including Data Modeling, Infrastructure setup on cloud, Data Warehousing and Data Lake development.
BitSail is a distributed high-performance data integration engine which supports batch, streaming and incremental scenarios. BitSail is widely used to synchronize hundreds of trillions of data every day.
An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.
Kylo is a data lake management software platform and framework for enabling scalable enterprise-class data lakes on big data technologies such as Teradata, Apache Spark and/or Hadoop. Kylo is licensed under Apache 2.0. Contributed by Teradata Inc.
Apache Amoro(incubating) is a Lakehouse management system built on open data lake formats.
Lakekeeper is an Apache-Licensed, secure, fast and easy to use Apache Iceberg REST Catalog written in Rust.
Personal Data Engineering Projects
Data API Framework for AI Agents and Data Apps
An efficient storage and compute engine for both on-prem and cloud-native data analytics.
🤖 The Semantic Engine for Model Context Protocol(MCP) Clients and AI Agents 🔥
Enterprise-grade, production-hardened, serverless data lake on AWS
Real Time Big Data / IoT Machine Learning (Model Training and Inference) with HiveMQ (MQTT), TensorFlow IO and Apache Kafka - no additional data store like S3, HDFS or Spark required
GigAPI is a Timeseries lakehouse for real-time data and sub-second queries, powered by DuckDB OLAP + Parquet Query Engine, Compactor w/ Cloud-Native Storage. Drop-in FDAP alternative ⭐
Amazon S3 Find and Forget is a solution to handle data erasure requests from data lakes stored on Amazon S3, for example, pursuant to the European General Data Protection Regulation (GDPR)
Resources for video demonstrations and blog posts related to DataOps on AWS
Data Forge — a modern data stack playground to practice flows and best practices, not just tools. Spark, Trino, Kafka, Iceberg, ClickHouse, Airflow, MinIO, Superset — all wired together locally with Docker Compose.
Cloudflare R2 bucket File Uploader with multipart upload enabled. Tested with files up to 10 GB size.
Smart Automation Tool for building modern Data Lakes and Data Pipelines
Apache Spark 3 - Structured Streaming Course Material
Apache Spark Course Material
Reference Architectures for Datalakes on AWS
Sample Data Lakehouse deployed in Docker containers using Apache Iceberg, Minio, Trino and a Hive Metastore. Can be used for local testing.
Road to Azure Data Engineer Part-I: DP-200 - Implementing an Azure Data Solution
Lighthouse is a library for data lakes built on top of Apache Spark. It provides high-level APIs in Scala to streamline data pipelines and apply best practices.
The DBT of ML, as Aligned describes data dependencies in ML systems, and reduce technical data debt
Udacity Data Engineering Nanodegree Program