datalake

There are 23 repositories under datalake topic.

Sinaptik-AI / pandas-ai
Chat with your database (SQL, CSV, pandas, polars, mongodb, noSQL, etc). PandasAI makes data analysis conversational using LLMs (GPT 3.5 / 4, Anthropic, VertexAI) and RAG.
ai csv data data-analysis data-science database datalake gpt-3 gpt-4 llm pandas sql
Language:Python 12130
trinodb / trino
Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
analytics big-data data-science database databases datalake delta-lake distributed-database distributed-systems hadoop hive iceberg java jdbc presto prestodb query-engine sql trino
Language:Java 9946
StarRocks / starrocks
StarRocks, a Linux Foundation project, is a next-generation sub-second MPP OLAP database for full analytics scenarios, including multi-dimensional analytics, real-time analytics, and ad-hoc queries.
analytics big-data cloudnative database datalake delta-lake distributed-database hudi iceberg join lakehouse lakehouse-platform mpp olap real-time-analytics real-time-updates realtime-database sql star-schema vectorized
Language:Java 8374
activeloopai / deeplake
Database for AI. Store Vectors, Images, Texts, Videos, etc. Use with LLMs/LangChain. Store, query, version, & visualize any AI data. Stream data in real-time to PyTorch/TensorFlow. https://activeloop.ai
datasets deep-learning machine-learning data-science pytorch tensorflow data-version-control python ai ml mlops computer-vision cv image-processing datalake langchain llm large-language-models vector-database vector-search
Language:Python 7908
apache / hudi
Upserts, Deletes And Incremental Processing on Big Data.
apacheflink apachehudi apachespark bigdata data-integration datalake hudi incremental-processing stream-processing
Language:Java 5234
paradedb / paradedb
Postgres for Search and Analytics
aggregations analytics big-data bm25 database datalake elasticsearch full-text-search htap hybrid-search iceberg lakehouse-platform mpp object-storage olap postgresql real-time-analytics similarity-search sparse-vector sql
Language:Rust 4891
lakeFS
treeverse / lakeFS
lakeFS - Data version control for your data lake | Git for data
apache-spark apache-sparksql aws-s3 azure-blob-storage azure-storage data-engineering data-lake data-quality data-version-control data-versioning datalake datalakes git-for-data go golang google-cloud-storage hadoop-filesystem lakefs object-storage
Language:Go 4235
dinky
DataLinkDC / dinky
Dinky is a real-time data development platform based on Apache Flink, enabling agile data development, deployment and operation.
datalake datawarehouse flink flinkcdc flinksql olap real-time-computing-platform sql
Language:Java 2964
lakesoul-io / LakeSoul
LakeSoul is an end-to-end, realtime and cloud native Lakehouse framework with fast data ingestion, concurrent update and incremental data analytics on cloud storages for both BI and AI applications.
lakesoul datalake lakehouse spark flink streaming big-data postgresql rust sql huggingface python pytorch arrow datafusion vectorized velox
Language:Java 2342
leo-project / leofs
The LeoFS Storage System
leofs erlang distributed-storage distributed-file-system s3-storage s3 nfs-server nfs datalake
Language:Erlang 1543
zingg
zinggAI / zingg
Scalable identity resolution, entity resolution, data mastering and deduplication using ML
fuzzymatch fuzzy-matching deduplication dedupe masterdata dataengineering data-transformation analytics-engineering entity-resolution identity-resolution data-transformations data-science spark ml etl dataquality identity modern-data-stack analytics datalake
Language:Java 922
apache / amoro
Apache Amoro (incubating) is a Lakehouse management system built on open data lake formats.
bigdata datalake lakehouse
Language:Java 766
apache / gravitino
World's most powerful open data catalog for building a high-performance, geo-distributed and federated metadata lake.
ai-catalog data-catalog datalake federated-query lakehouse metadata metalake model-catalog opendatacatalog skycomputing stratosphere
Language:Java 745
leesf / hudi-resources
汇总Apache Hudi相关资料
apache apachehudi bigdata data-integration datalake hudi hudi-resources incremental-processing stream-processing
527
automate-dv
Datavault-UK / automate-dv
A free to use dbt package for creating and loading Data Vault 2.0 compliant Data Warehouses (powered by dbt, an open source data engineering tool, registered trademark of dbt Labs)
data-vault dataengineering datalake datavault datavault20 datawarehouse datawarehousing dbt elt etl metadata snowflake sql
482
cuebook / cuelake
Use SQL to build ELT pipelines on a data lakehouse.
apache-iceberg delta lakehouse datalake data-lake elt etl data-engineering data-integration data-ingestion apache-spark spark-sql upsert incremental-updates data-transfer pipelines data-pipeline zeppelin-notebook sql
Language:JavaScript 283
linkedin / openhouse
Open Control Plane for Tables in Data Lakehouse
big-data catalog datalake datalakehouse declarative iceberg management tables
Language:Java 274
japila-books / delta-lake-internals
The Internals of Delta Lake
book books datalake delta-lake deltalake internals
179
awslabs / aws-orbit-workbench
A Data Platform built for AWS, powered by Kubernetes.
analytics aws data-analysis dataengineering datalake eks eks-cluster gpu jupyter jupyterhub kubernetes mach orbit-workbench redshift workbench
Language:Python 128
UncoderIO / Uncoder_IO
An IDE and translation engine for detection engineers and threat hunters. Be faster, write smarter, keep 100% privacy.
datalake edr roota siem sigma threathunting translation xdr uncoder uncoderio
Language:Python 114
UncoderIO / Roota
Roota is a public-domain language of threat detection and response that combines native queries from a SIEM, EDR, XDR, or Data Lake with standardized metadata and threat intelligence to enable automated translation into other languages
datalake edr roota siem xdr rootalanguage
109
Real-time-Data-Warehouse
izhangzhihao / Real-time-Data-Warehouse
Real-time Data Warehouse with Apache Flink & Apache Kafka & Apache Hudi
flink data-warehouse real-time-data-warehouse data-warehousing flink-sql debezium kafka elasticsearch delta-lake cdc change-data-capture hudi hoodie iceberg sql datalake delta deltalake spark spark-sql
Language:Dockerfile 101
WeBankFinTech / Streamis
Streaming application development and management system, based on Linkis and DSS, planning to provide the workflow-like graphical drag-and-drop development capability.
flink linkis dataspherestudio wedatasphere streamis streaming hudi iceberg datalake warehouse kafka deltalake
Language:Java 99
pracdata / awesome-open-source-data-engineering
A curated list of open source tools used in analytical stacks and data engineering ecosystem
awesome awesome-list data-analytics data-engineering data-platform database self-hosted mlops data-storage data data-integration data-lakehouse datalake lakehouse workflow-engine analytics data-warehouse observability data-pipeline etl
89
ApacheSpark
martandsingh / ApacheSpark
This repository will help you to learn about databricks concept with the help of examples. It will include all the important topics which we need in our real life experience as a data engineer. We will be using pyspark & sparksql for the development. At the end of the course we also cover few case studies.
apachespark data-analysis data-engineering database databricks datalake deltalake etl-pipeline hadoop hive spark spark-sql spark-streaming timetravel etl pyspark sql
Language:Python 88
LearningJournal / SparkProgrammingInScala
Apache Spark Course Material
apache-spark spark spark-sql spark-scala big-data bigdata datalake data-lake scala
Language:Scala 83
apache / doris-website
Apache Doris Website
doris analytics apache big-data data-warehousing database datalake dbms distributed-system hadoop hive hudi iceberg mpp olap ssb tpch vectorized
Language:TypeScript 76
GitDataAI / jiaozifs
An Git-like version control file system for data lineage & data collaboration.
data-collaboration data-versioning aiops data-mesh data-product dataops digital-twins enterprise-datahub mlops federated-learning data-lake-management data-version-control data-lake datalake git-for-data git-filesystem jiaozifs version-controlled-filesystem data-lineage
Language:Go 68
vim89 / datapipelines-essentials-python
Simplified ETL process in Hadoop using Apache Spark. Has complete ETL pipeline for datalake. SparkSession extensions, DataFrame validation, Column extensions, SQL functions, and DataFrame transformations
apache-spark spark spark-sql python python3 pyspark etl etl-pipeline etl-framework etl-components xml xml-parsing datalake big-data hadoop hadoop-mapreduce hadoop-hdfs data-pipeline
Language:Python 53
fuslab / anyscale
anyscale roadmap
datalake fusiondb rdbms spark
49
hifxit / dataligo
A library to accelerate ML and ETL pipeline by connecting all data sources
database datalake datawarehouse etl-pipeline ml-pipeline nosql python
Language:Python 47
PaloAltoNetworks / pan-cortex-data-lake-python
Python idiomatic SDK for Cortex™ Data Lake.
pancloud paloaltonetworks applicationframework python sdk api rest-api panw pan logging-service event-service directory-sync-service directory-sync paloalto logging directory event cortex datalake data
Language:Python 43
LearningJournal / Spark-Streaming-In-Scala
Apache Spark 3 - Structured Streaming Course Material
apache-spark big-data bigdata datalake scala spark spark-sql spark-streaming
Language:Scala 41
rlevchenko / terraform-azure-data
Terraform script to deploy almost all Azure Data Services
servicebus datalake datafactory eventhub databricks azurefunctions azuredataexplorer kusto azureanalysisserver eventgrid azure-data-factory azure-resources data-lake-storage azure lrs
Language:HCL 37
apiary
ExpediaGroup / apiary
Apiary provides modules which can be combined to create a federated cloud data lake
hive datalake aws hive-metastore
35
DataTech-Solutions / Threat-Detection-and-Visualization
Threat Detection and Visualization
api datafactory datalake defender deltalake keyvault parquet-files postman powerbi rolebasedpermissions sccm serverlesssqlpool servicenow siem sql tenablesc dedicatedsqlpool active-directory azuremlstudio machine-learning
Language:TSQL 33