There are 19 repositories under data-lake topic.
Few projects related to Data Engineering including Data Modeling, Infrastructure setup on cloud, Data Warehousing and Data Lake development.
An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.
Personal Data Engineering Projects
Data API Framework for AI Agents and Data Apps
Enterprise-grade, production-hardened, serverless data lake on AWS
Real Time Big Data / IoT Machine Learning (Model Training and Inference) with HiveMQ (MQTT), TensorFlow IO and Apache Kafka - no additional data store like S3, HDFS or Spark required
Amazon S3 Find and Forget is a solution to handle data erasure requests from data lakes stored on Amazon S3, for example, pursuant to the European General Data Protection Regulation (GDPR)
Resources for video demonstrations and blog posts related to DataOps on AWS
A Rust implementation of the Iceberg REST Catalog specification.
Apache Spark 3 - Structured Streaming Course Material
🤖 The semantic engine for LLMs, bringing semantic context to AI agents. 🔥
Smart Automation Tool for building modern Data Lakes and Data Pipelines
Cloudflare R2 bucket File Uploader with multipart upload enabled. Tested with files up to 10 GB size.
Apache Spark Course Material
Reference Architectures for Datalakes on AWS
Road to Azure Data Engineer Part-I: DP-200 - Implementing an Azure Data Solution
Lighthouse is a library for data lakes built on top of Apache Spark. It provides high-level APIs in Scala to streamline data pipelines and apply best practices.
Sample Data Lakehouse deployed in Docker containers using Apache Iceberg, Minio, Trino and a Hive Metastore. Can be used for local testing.
Udacity Data Engineering Nanodegree Program
Learn how to use Kinesis Firehose, AWS Glue, S3, and Amazon Athena by streaming and analyzing reddit comments in realtime. 100-200 level tutorial.
rtdl makes it easy to build and maintain a real-time data lake