There are 19 repositories under data-lake topic.
Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses.
Few projects related to Data Engineering including Data Modeling, Infrastructure setup on cloud, Data Warehousing and Data Lake development.
BitSail is a distributed high-performance data integration engine which supports batch, streaming and incremental scenarios. BitSail is widely used to synchronize hundreds of trillions of data every day.
An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.
Kylo is a data lake management software platform and framework for enabling scalable enterprise-class data lakes on big data technologies such as Teradata, Apache Spark and/or Hadoop. Kylo is licensed under Apache 2.0. Contributed by Teradata Inc.
Personal Data Engineering Projects
Data API Framework for AI Agents and Data Apps
An efficient storage and compute engine for both on-prem and cloud-native data analytics.
Lakekeeper is an Apache-Licensed, secure, fast and easy to use Apache Iceberg REST Catalog written in Rust.
Enterprise-grade, production-hardened, serverless data lake on AWS
Real Time Big Data / IoT Machine Learning (Model Training and Inference) with HiveMQ (MQTT), TensorFlow IO and Apache Kafka - no additional data store like S3, HDFS or Spark required
🤖 The Semantic Engine for Model Context Protocol(MCP) Clients and AI Agents 🔥
Amazon S3 Find and Forget is a solution to handle data erasure requests from data lakes stored on Amazon S3, for example, pursuant to the European General Data Protection Regulation (GDPR)
Resources for video demonstrations and blog posts related to DataOps on AWS
Cloudflare R2 bucket File Uploader with multipart upload enabled. Tested with files up to 10 GB size.
Apache Spark 3 - Structured Streaming Course Material
Smart Automation Tool for building modern Data Lakes and Data Pipelines
Git based Version Control File System for joint management of code, data, model and their relationship.
Apache Spark Course Material
Reference Architectures for Datalakes on AWS
Road to Azure Data Engineer Part-I: DP-200 - Implementing an Azure Data Solution
Sample Data Lakehouse deployed in Docker containers using Apache Iceberg, Minio, Trino and a Hive Metastore. Can be used for local testing.
Lighthouse is a library for data lakes built on top of Apache Spark. It provides high-level APIs in Scala to streamline data pipelines and apply best practices.
The DBT of ML, as Aligned describes data dependencies in ML systems, and reduce technical data debt
Udacity Data Engineering Nanodegree Program
Data Engineer with Python lecture notes from #datacamp.
rtdl makes it easy to build and maintain a real-time data lake