There are 23 repositories under datalake topic.
Chat with your database (SQL, CSV, pandas, polars, mongodb, noSQL, etc). PandasAI makes data analysis conversational using LLMs (GPT 3.5 / 4, Anthropic, VertexAI) and RAG.
Database for AI. Store Vectors, Images, Texts, Videos, etc. Use with LLMs/LangChain. Store, query, version, & visualize any AI data. Stream data in real-time to PyTorch/TensorFlow. https://activeloop.ai
Dinky is a real-time data development platform based on Apache Flink, enabling agile data development, deployment and operation.
LakeSoul is an end-to-end, realtime and cloud native Lakehouse framework with fast data ingestion, concurrent update and incremental data analytics on cloud storages for both BI and AI applications.
The LeoFS Storage System
汇总Apache Hudi相关资料
A free to use dbt package for creating and loading Data Vault 2.0 compliant Data Warehouses (powered by dbt, an open source data engineering tool, registered trademark of dbt Labs)
World's most powerful data catalog service with providing a high-performance, geo-distributed and federated metadata lake.
The Internals of Delta Lake
A Data Platform built for AWS, powered by Kubernetes.
An IDE and translation engine for detection engineers and threat hunters. Be faster, write smarter, keep 100% privacy.
Real-time Data Warehouse with Apache Flink & Apache Kafka & Apache Hudi
Streaming application development and management system, based on Linkis and DSS, planning to provide the workflow-like graphical drag-and-drop development capability.
Apache Spark Course Material
This repository will help you to learn about databricks concept with the help of examples. It will include all the important topics which we need in our real life experience as a data engineer. We will be using pyspark & sparksql for the development. At the end of the course we also cover few case studies.
Apache Doris Website
Simplified ETL process in Hadoop using Apache Spark. Has complete ETL pipeline for datalake. SparkSession extensions, DataFrame validation, Column extensions, SQL functions, and DataFrame transformations
A curated list of open source tools used in analytical stacks and data engineering ecosystem
Python idiomatic SDK for Cortex™ Data Lake.
Apache Spark 3 - Structured Streaming Course Material
Terraform script to deploy almost all Azure Data Services
Threat Detection and Visualization
Apiary provides modules which can be combined to create a federated cloud data lake