There are 77 repositories under data-pipeline topic.
Empowering Data Intelligence with Distributed SQL for Sharding, Scalability, and Security Across All Databases.
The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
Change data capture for a variety of databases. Please log issues at https://issues.redhat.com/browse/DBZ.
Data processing for and with foundation models! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷
Privacy and Security focused Segment-alternative, in Golang and React
A list of useful resources to learn Data Engineering from scratch
Memphis.dev is a highly scalable and effortless data streaming platform
ingestr is a CLI tool to copy data between any databases with a single command seamlessly.
A powerful, portable, local-first workflow engine for managing complex jobs without pain. Single binary with Web UI. 100% open source. No vendor lock-in. It natively supports running containers and executing commands over SSH. Offline or air-gapped environment ready.
An open-source data logging library for machine learning models and data pipelines. 📚 Provides visibility into data quality & model performance over time. 🛡️ Supports privacy-preserving data collection, ensuring safety & robustness. 📈
The dbt-native data observability solution for data & analytics engineers. Monitor your data pipelines in minutes. Available as self-hosted or cloud service with premium features.
A lightweight stream processing library for Go
BitSail is a distributed high-performance data integration engine which supports batch, streaming and incremental scenarios. BitSail is widely used to synchronize hundreds of trillions of data every day.
🔥🔥🔥 Open source Reverse ETL - alternative to hightouch and census.
Superlinked is a Python framework for AI Engineers building high-performance search & recommendation applications that combine structured and unstructured data.
Source code accompanying book: Data Science on the Google Cloud Platform, Valliappa Lakshmanan, O'Reilly 2017
Example end to end data engineering project.
Fastest open-source tool for replicating Databases to Data Lake in Open Table Formats like Apache Iceberg. ⚡ Efficient, quick and scalable data ingestion for real-time analytics. Supporting Postgres, MongoDB , MySQL and Oracle
Pythonic tool for orchestrating machine-learning/high performance/quantum-computing workflows in heterogeneous compute environments.
SeaTunnel is a distributed, high-performance data integration platform for the synchronization and transformation of massive data (offline & real-time).
Practical Data Engineering: A Hands-On Real-Estate Project Guide
A list about Apache Kafka
Conduit streams data between data stores. Kafka Connect replacement. No JVM required.
Augmentation pipeline for rendering synthetic paper printing, faxing, scanning and copy machine processes
A curated list of open source tools used in analytics platforms and data engineering ecosystem
Deal with bad samples in your dataset dynamically, use Transforms as Filters, and more!
Code for "Efficient Data Processing in Spark" Course