data-pipeline

There are 77 repositories under data-pipeline topic.

apache / shardingsphere
Empowering Data Intelligence with Distributed SQL for Sharding, Scalability, and Security Across All Databases.
database distributed-database distributed-sql-database sql shard database-cluster mysql postgresql encrypt bigdata data-encryption data-pipeline database-middleware distributed-transaction read-write-splitting database-gateway
Language:Java 20521
airbytehq / airbyte
The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
bigquery change-data-capture data data-analysis data-collection data-engineering data-integration data-pipeline elt etl java mssql mysql pipeline postgresql python redshift s3 self-hosted snowflake
Language:Python 19979
debezium / debezium
Change data capture for a variety of databases. Please log issues at https://issues.redhat.com/browse/DBZ.
apache-kafka cdc change-data-capture data-pipeline database debezium event-streaming hacktoberfest kafka kafka-connect kafka-producer
Language:Java 12044
snowplow
snowplow / snowplow
The leader in Customer Data Infrastructure
analytics data data-collection data-pipeline marketing-analytics product-analytics snowplow snowplow-events snowplow-pipeline
Language:Scala 6964
apache / flink-cdc
Flink CDC is a streaming data integration tool
change-data-capture cdc batch data-integration data-pipeline distributed elt etl flink kafka mysql paimon postgresql real-time schema-evolution
Language:Java 6259
modelscope / data-juicer
Data processing for and with foundation models! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷
data-analysis data-science large-language-models llm data-visualization llms instruction-tuning pre-training multi-modal synthetic-data data data-pipeline data-processing foundation-models
Language:Python 5264
rudder-server
rudderlabs / rudder-server
Privacy and Security focused Segment-alternative, in Golang and React
bigquery cdp customer-data customer-data-lake customer-data-pipeline customer-data-platform data-engineering data-integration data-pipeline data-synchronization data-warehouse elt etl event-streaming privacy redshift segment-alternative snowflake warehouse-management warehouse-native
Language:Go 4314
adilkhash / Data-Engineering-HowTo
A list of useful resources to learn Data Engineering from scratch
distributed-systems data-engineering data-pipeline cloud-providers scala
3907
memphis
superstreamlabs / memphis
Memphis.dev is a highly scalable and effortless data streaming platform
data data-stream-processing data-streaming kubernetes messaging-queue data-engineering data-pipeline golang enrichment message-broker message-bus message-queue microservices schema-registry
Language:Go 3407
bruin-data / ingestr
ingestr is a CLI tool to copy data between any databases with a single command seamlessly.
bigquery copy-database data-ingestion data-integration data-pipeline duckdb ingestion-pipeline mssql postgresql snowflake
Language:Python 3327
dagu
dagu-org / dagu
A powerful, portable, local-first workflow engine for managing complex jobs without pain. Single binary with Web UI. 100% open source. No vendor lock-in. It natively supports running containers and executing commands over SSH. Offline or air-gapped environment ready.
cron task-scheduler continuous-delivery workflow-engine workflow-scheduler workflow-orchestration devops data-pipeline job-scheduler task-automation agent-workflow ai-workflow dag directed-acyclic-graph workflow-management
Language:Go 2772
whylabs / whylogs
An open-source data logging library for machine learning models and data pipelines. 📚 Provides visibility into data quality & model performance over time. 🛡️ Supports privacy-preserving data collection, ensuring safety & robustness. 📈
ai-pipelines approximate-statistics statistical-properties data-quality calculate-statistics python logging mlops dataops ml-pipelines data-pipeline dataset machine-learning data-science analytics constraints data-constraints model-performance
Language:Jupyter Notebook 2764
elementary
elementary-data / elementary
The dbt-native data observability solution for data & analytics engineers. Monitor your data pipelines in minutes. Available as self-hosted or cloud service with premium features.
data-lineage data-governance data-warehouse snowflake bigquery data-analysis data-pipelines data-pipeline lineage data-reliability data-observability dataops dbt dbt-packages analytics-engineer dbt-artifacts redshift
Language:HTML 2181
go-streams
reugn / go-streams
A lightweight stream processing library for Go
aerospike data-pipeline data-stream etl kafka kafka-streams low-code nats-streaming pipeline pulsar redis stream-processing stream-processor streaming-api streaming-data streams throttling websocket windowing workflow
Language:Go 2120
pydoit / doit
CLI task management & automation tool
build-automation build-system build-tool cli data-pipeline data-science hacktoberfest python task-runner workflow workflow-automation workflow-management
Language:Python 1983
bytedance / bitsail
BitSail is a distributed high-performance data integration engine which supports batch, streaming and incremental scenarios. BitSail is widely used to synchronize hundreds of trillions of data every day.
big-data data-integration data-lake data-pipeline data-synchronization flink high-performance real-time
Language:Java 1675
multiwoven
Multiwoven / multiwoven
🔥🔥🔥 Open source Reverse ETL - alternative to hightouch and census.
data-engineering reverse-etl data-pipeline data-activation etl react ruby self-hosted open-source dbt bigquery data-warehouse databricks postresql redshift snowflake typescript hacktoberfest cdp customer-data-platform
Language:Ruby 1631
superlinked / superlinked
Superlinked is a Python framework for AI Engineers building high-performance search & recommendation applications that combine structured and unstructured data.
embeddings etl vector-search data-pipeline deep-learning information-retrieval llm ml mlops natural-language-processing nlp python retrieval retrieval-augmented-generation semantic-search vectorization vector-database
Language:Jupyter Notebook 1425
GoogleCloudPlatform / data-science-on-gcp
Source code accompanying book: Data Science on the Google Cloud Platform, Valliappa Lakshmanan, O'Reilly 2017
data-analysis data-visualization cloud-computing machine-learning data-pipeline data-processing data-science data-engineering
Language:Jupyter Notebook 1399
damklis / DataEngineeringProject
Example end to end data engineering project.
airflow big-data data-engineering data-pipeline debezium django-rest-framework elasticsearch hacktoberfest kafka kafka-connect minio mongodb python redis s3 scraping
Language:Python 1347
datazip-inc / olake
Fastest open-source tool for replicating Databases to Data Lake in Open Table Formats like Apache Iceberg. ⚡ Efficient, quick and scalable data ingestion for real-time analytics. Supporting Postgres, MongoDB , MySQL and Oracle
cdc change-data-capture data-pipeline database elt lakehouse replication apache-iceberg parquet s3 hacktoberfest
Language:Go 1174
klio
spotify / klio
Smarter data pipelines for audio.
audio-processing data-pipeline signal-processing media-processing
Language:Python 862
covalent
AgnostiqHQ / covalent
Pythonic tool for orchestrating machine-learning/high performance/quantum-computing workflows in heterogeneous compute environments.
covalent data-pipeline data-science deep-learning hacktoberfest hpc hpc-applications machine-learning machinelearning machinelearning-python orchestration parallelization pipelines python quantum quantum-computing quantum-machine-learning workflow workflow-automation workflow-management
Language:Python 844
apache / seatunnel-web
SeaTunnel is a distributed, high-performance data integration platform for the synchronization and transformation of massive data (offline & real-time).
apache data-integration data-pipeline etl-framework high-performance offline real-time seatunnel sql-engine
Language:Java 736
ssp-data / practical-data-engineering
Practical Data Engineering: A Hands-On Real-Estate Project Guide
dagster data-engineering data-pipeline
Language:Jupyter Notebook 712
infoslack / awesome-kafka
A list about Apache Kafka
kafka streaming-data data-pipeline stream-processing apache-kafka apache-spark kafka-streams infrastructure data-processing
583
conduit
ConduitIO / conduit
Conduit streams data between data stores. Kafka Connect replacement. No JVM required.
conduit data-engineering data-integration data-pipeline data-stream etl go kafka kafkaconnect
Language:Go 564
piperider
InfuseAI / piperider
Code review for data in dbt
data-pipeline data-profiling data-quality data-science data-exploration eda exploratory-data-analysis data-testing python data-observability data-profiler data-reliability data-visualization dbt dbt-metrics code-review reporting pull-requests continuous-integration
Language:Python 492
augraphy
sparkfish / augraphy
Augmentation pipeline for rendering synthetic paper printing, faxing, scanning and copy machine processes
data-augmentation crappification deep-neural-networks training-data machine-learning data-pipeline image-processing augmentation-pipeline synthetic-data synthetic-dataset-generation computer-vision
Language:Python 473
1kbgz / tributary
Streaming reactive and dataflow graphs in Python
python python3 stream data-pipeline streaming asynchronous reactive-data-streams python-data-streams kafka websockets lazy-evaluation
Language:Python 458
pracdata / awesome-open-source-data-engineering
A curated list of open source tools used in analytics platforms and data engineering ecosystem
awesome awesome-list data-analytics data-engineering data-platform database self-hosted mlops data-storage data data-integration data-lakehouse datalake lakehouse workflow-engine analytics data-warehouse observability data-pipeline etl
389
msamogh / nonechucks
Deal with bad samples in your dataset dynamically, use Transforms as Filters, and more!
pytorch data-processing data-preprocessing data-pipeline data-cleaning preprocessing machine-learning torch
Language:Python 378
dataflint / spark
Drop-in replacement for Apache Spark UI
apache-spark big-data data-pipeline data-pipelines databricks dataproc emr etl observability optimization spark-operator
Language:TypeScript 340
josephmachado / efficient_data_processing_spark
Code for "Efficient Data Processing in Spark" Course
apache-spark data-engineering data-pipeline minio pyspark pyspark-notebook
Language:Python 339
elbwalker / walkerOS
Open-source data collection and tag management
data-collection privacy-by-design behavioral-data component-driven event-tracking web-analytics tagging tag-manager
Language:TypeScript 310
cuebook / cuelake
Use SQL to build ELT pipelines on a data lakehouse.
apache-iceberg delta lakehouse datalake data-lake elt etl data-engineering data-integration data-ingestion apache-spark spark-sql upsert incremental-updates data-transfer pipelines data-pipeline zeppelin-notebook sql
Language:JavaScript 288