There are 23 repositories under datalake topic.
Chat with your database (SQL, CSV, pandas, polars, mongodb, noSQL, etc). PandasAI makes data analysis conversational using LLMs (GPT 3.5 / 4, Anthropic, VertexAI) and RAG.
Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
StarRocks, a Linux Foundation project, is a next-generation sub-second MPP OLAP database for full analytics scenarios, including multi-dimensional analytics, real-time analytics, and ad-hoc queries.
Database for AI. Store Vectors, Images, Texts, Videos, etc. Use with LLMs/LangChain. Store, query, version, & visualize any AI data. Stream data in real-time to PyTorch/TensorFlow. https://activeloop.ai
Dinky is a real-time data development platform based on Apache Flink, enabling agile data development, deployment and operation.
LakeSoul is an end-to-end, realtime and cloud native Lakehouse framework with fast data ingestion, concurrent update and incremental data analytics on cloud storages for both BI and AI applications.
The LeoFS Storage System
Scalable identity resolution, entity resolution, data mastering and deduplication using ML
Apache Amoro (incubating) is a Lakehouse management system built on open data lake formats.
World's most powerful open data catalog for building a high-performance, geo-distributed and federated metadata lake.
汇总Apache Hudi相关资料
A free to use dbt package for creating and loading Data Vault 2.0 compliant Data Warehouses (powered by dbt, an open source data engineering tool, registered trademark of dbt Labs)
The Internals of Delta Lake
A Data Platform built for AWS, powered by Kubernetes.
An IDE and translation engine for detection engineers and threat hunters. Be faster, write smarter, keep 100% privacy.
Real-time Data Warehouse with Apache Flink & Apache Kafka & Apache Hudi
Streaming application development and management system, based on Linkis and DSS, planning to provide the workflow-like graphical drag-and-drop development capability.
A curated list of open source tools used in analytical stacks and data engineering ecosystem
This repository will help you to learn about databricks concept with the help of examples. It will include all the important topics which we need in our real life experience as a data engineer. We will be using pyspark & sparksql for the development. At the end of the course we also cover few case studies.
Apache Spark Course Material
Apache Doris Website
Simplified ETL process in Hadoop using Apache Spark. Has complete ETL pipeline for datalake. SparkSession extensions, DataFrame validation, Column extensions, SQL functions, and DataFrame transformations
Python idiomatic SDK for Cortex™ Data Lake.
Apache Spark 3 - Structured Streaming Course Material
Terraform script to deploy almost all Azure Data Services
Apiary provides modules which can be combined to create a federated cloud data lake
Threat Detection and Visualization