data-lake

There are 19 repositories under data-lake topic.

lakeFS
treeverse / lakeFS
lakeFS - Data version control for your data lake | Git for data
apache-spark apache-sparksql aws-s3 azure-blob-storage azure-storage data-engineering data-lake data-quality data-version-control data-versioning datalake datalakes git-for-data go golang google-cloud-storage hadoop-filesystem lakefs object-storage
Language:Go 4354
dlt-hub / dlt
data load tool (dlt) is an open source Python library that makes data loading easy 🛠️
data data-engineering data-lake data-loading data-warehouse elt extract load python transform
Language:Python 2319
apache / kyuubi
Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses.
data-lake hacktoberfest hadoop hive jdbc kubernetes spark spark-sql sql thrift
Language:Scala 2074
bytedance / bitsail
BitSail is a distributed high-performance data integration engine which supports batch, streaming and incremental scenarios. BitSail is widely used to synchronize hundreds of trillions of data every day.
big-data data-integration data-lake data-pipeline data-synchronization flink high-performance real-time
Language:Java 1614
Udacity-Data-Engineering-Projects
san089 / Udacity-Data-Engineering-Projects
Few projects related to Data Engineering including Data Modeling, Infrastructure setup on cloud, Data Warehousing and Data Lake development.
data data-engineering data-engineering-pipeline etl-pipeline cassandra-database postgresql-database data-modeling data-warehouse data-lake airflow airflow-operators cluster cassandra infrastructure postgres aws aws-ec2 aws-sdk aws-s3 cloudformation
Language:Python 1475
goodreads_etl_pipeline
san089 / goodreads_etl_pipeline
An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.
airflow airflow-dag apache-airflow apache-spark data-engineering data-engineering-pipeline data-lake data-migration emr-cluster etl-framework etl-job etl-pipeline goodreads-data-pipeline livy python redshift s3 scheduler spark warehouse
Language:Python 1273
Teradata / kylo
Kylo is a data lake management software platform and framework for enabling scalable enterprise-class data lakes on big data technologies such as Teradata, Apache Spark and/or Hadoop. Kylo is licensed under Apache 2.0. Contributed by Teradata Inc.
data-lake hadoop kylo nifi spark teradata
Language:Java 1104
alanchn31 / Data-Engineering-Projects
Personal Data Engineering Projects
data-lake ingest-data data-engineering data-warehouse cassandra aws-redshift mongodb scrapy spark airflow postgres data-engineering-nanodegree star-schema data-modeling
Language:Jupyter Notebook 820
Canner / vulcan-sql
Data API Framework for AI Agents and Data Apps
api-builder data-lake data-warehouse database sql analytics reporting spreadsheet bigquery duckdb postgresql snowflake restful-api typescript clickhouse vulcan-sql vulcansql ksqldb ai ai-agent
Language:TypeScript 625
uber / marmaray
Generic Data Ingestion & Dispersal Library for Hadoop
avro-schema data-lake hadoop ingest-data schema-format spark
Language:Java 479
aws-serverless-data-lake-framework
awslabs / aws-serverless-data-lake-framework
Enterprise-grade, production-hardened, serverless data lake on AWS
analytics aws best-practices data-engineering data-lake etl framework iac lake-formation serverless
Language:Python 407
kaiwaehner / hivemq-mqtt-tensorflow-kafka-realtime-iot-machine-learning-training-inference
Real Time Big Data / IoT Machine Learning (Model Training and Inference) with HiveMQ (MQTT), TensorFlow IO and Apache Kafka - no additional data store like S3, HDFS or Spark required
kafka hivemq mqtt kafka-streams kafka-connect ksql tensorflow grpc java python tiered-storage data-lake confluent ksqldb tensorflow-io terraform gcp kubernetes cloud mongodb
Language:Jupyter Notebook 402
cuebook / cuelake
Use SQL to build ELT pipelines on a data lakehouse.
apache-iceberg delta lakehouse datalake data-lake elt etl data-engineering data-integration data-ingestion apache-spark spark-sql upsert incremental-updates data-transfer pipelines data-pipeline zeppelin-notebook sql
Language:JavaScript 285
amazon-s3-find-and-forget
awslabs / amazon-s3-find-and-forget
Amazon S3 Find and Forget is a solution to handle data erasure requests from data lakes stored on Amazon S3, for example, pursuant to the European General Data Protection Regulation (GDPR)
data-lake amazon-s3 s3 gdpr aws parquet data-erasure right-to-be-forgotten ccpa big-data privacy data
Language:Python 238
Azure / usql
U-SQL Examples and Issue Tracking
big-data u-sql azure data-lake
Language:C# 234
maxi-k / btrblocks
BtrBlocks: Efficient Columnar Compression for Data Lakes (SIGMOD 2023 Paper)
compression data-lake databases research
Language:C++ 213
garystafford / tickit-data-lake-demo
Resources for video demonstrations and blog posts related to DataOps on AWS
dataops aws devops airflow redshift data-lake
Language:Python 166
hansetag / iceberg-catalog
A Rust implementation of the Iceberg REST Catalog specification.
catalog data-lake iceberg lakehouse lakehouse-governance open-lakehouse rust
Language:Rust 147
Azure / AzureDataLake
Samples and Docs for Azure Data Lake Store and Analytics
big-data azure data-lake
140
pixelsdb / pixels
An efficient storage and compute engine for both on-prem and cloud-native data analytics.
cloud-database data-lake data-warehouse database olap column-store
Language:Java 128
LearningJournal / Spark-Streaming-In-Python
Apache Spark 3 - Structured Streaming Course Material
apache-spark spark-streaming spark-sql pyspark python big-data bigdata data-lake
Language:Python 120
wren-engine
Canner / wren-engine
🤖 The semantic engine for LLMs, bringing semantic context to AI agents. 🔥
business-intelligence data data-analysis data-analytics data-lake data-warehouse sql semantic semantic-layer llm
Language:Java 115
smart-data-lake / smart-data-lake
Smart Automation Tool for building modern Data Lakes and Data Pipelines
data-lake data-pipelines deltalake hadoop hive scala smart-data-lake spark transform-data
Language:Scala 104
datopian / r2-bucket-uploader
Cloudflare R2 bucket File Uploader with multipart upload enabled. Tested with files up to 10 GB size.
blob blob-storage bucket cloudflare data-lake object-storage r2 s3
Language:TypeScript 86
LearningJournal / SparkProgrammingInScala
Apache Spark Course Material
apache-spark spark spark-sql spark-scala big-data bigdata datalake data-lake scala
Language:Scala 84
aws-samples / aws-dbs-refarch-datalake
Reference Architectures for Datalakes on AWS
data-lake data-analytics amazon-emr ingest-data emr-cluster glue hive-metastore data-catalog data-transformation
Language:HTML 76
GitDataAI / jiaozifs
A Git-like version control file system for data lineage & data collaboration.
data-collaboration data-versioning aiops data-mesh data-product dataops digital-twins enterprise-datahub mlops federated-learning data-lake-management data-version-control data-lake datalake git-for-data git-filesystem jiaozifs version-controlled-filesystem data-lineage
Language:Go 74
Jayvardhan-Reddy / Azure-Certification-DP-200
Road to Azure Data Engineer Part-I: DP-200 - Implementing an Azure Data Solution
azure certification-prep azure-certification resources dp-200 data-engineering data-engineer azure-storage azure-services data-storage data-lake azure-portal sql-dw azure-databricks azure-data-factory azure-cosmos-db polybase exam-prep microsoft-azure batch-processing
65
camunda-community-hub / zeeqs
GraphQL API for Zeebe data
zeebe graphql data-lake zeebe-tool
Language:Kotlin 62
datamindedbe / lighthouse
Lighthouse is a library for data lakes built on top of Apache Spark. It provides high-level APIs in Scala to streamline data pipelines and apply best practices.
data-lake
Language:Scala 60
querypal
OElesin / querypal
Web UI for Amazon Athena
aws-athena data analytics sql aws data-lake
Language:Vue 54
dominikhei / Local-Data-LakeHouse
Sample Data Lakehouse deployed in Docker containers using Apache Iceberg, Minio, Trino and a Hive Metastore. Can be used for local testing.
apache-iceberg data-lake data-lakehouse hive-metastore lakehouse minio trino
Language:Dockerfile 51
KentHsu / Udacity-Data-Engineering-Nanodgree
Udacity Data Engineering Nanodegree Program
data-engineering postgresql apache-cassandra aws-redshift aws-s3 apache-spark apache-airflow data-pipelines data-warehouses data-lake data-quality
Language:Jupyter Notebook 51
MatsMoll / aligned
The DBT of ML, as Aligned describes data dependencies in ML systems, and reduce technical data debt
ai feature-engineering feature-store ml mlops dbt data-contracts datacontracts ml-ops data-lake
Language:Python 49
aws-samples / analyzing-reddit-sentiment-with-aws
Learn how to use Kinesis Firehose, AWS Glue, S3, and Amazon Athena by streaming and analyzing reddit comments in realtime. 100-200 level tutorial.
kinesis-firehose delivery-stream aws-glue data-stream amazon-athena data-lake reddit sentiment-analysis sentiment-classification real-time self-learning tutorials
Language:Python 44
realtimedatalake / rtdl
rtdl makes it easy to build and maintain a real-time data lake
open-source data data-engineering data-analysis data-science data-collection product-analytics marketing-analytics customer-data big-data data-lake golang apache-flink stateful-functions apache-kafka real-time-data-lake rtdl docker docker-compose dremio
Language:Go 44

data-lake

treeverse / lakeFS

dlt-hub / dlt

apache / kyuubi

bytedance / bitsail

san089 / Udacity-Data-Engineering-Projects

san089 / goodreads_etl_pipeline

Teradata / kylo

alanchn31 / Data-Engineering-Projects

Canner / vulcan-sql

uber / marmaray

awslabs / aws-serverless-data-lake-framework

kaiwaehner / hivemq-mqtt-tensorflow-kafka-realtime-iot-machine-learning-training-inference

cuebook / cuelake

awslabs / amazon-s3-find-and-forget

Azure / usql

maxi-k / btrblocks

garystafford / tickit-data-lake-demo

hansetag / iceberg-catalog

Azure / AzureDataLake

pixelsdb / pixels

LearningJournal / Spark-Streaming-In-Python

Canner / wren-engine

smart-data-lake / smart-data-lake

datopian / r2-bucket-uploader

LearningJournal / SparkProgrammingInScala

aws-samples / aws-dbs-refarch-datalake

GitDataAI / jiaozifs

Jayvardhan-Reddy / Azure-Certification-DP-200

camunda-community-hub / zeeqs

datamindedbe / lighthouse

OElesin / querypal

dominikhei / Local-Data-LakeHouse

KentHsu / Udacity-Data-Engineering-Nanodgree

MatsMoll / aligned

aws-samples / analyzing-reddit-sentiment-with-aws

realtimedatalake / rtdl