etl-pipeline

There are 48 repositories under etl-pipeline topic.

risingwave
risingwavelabs / risingwave
Streaming data platform. Real-time stream processing, low-latency serving, and Iceberg table management.
apache-iceberg data-engineering database etl-pipeline kafka materialized-view postgresql rust stream-processing
Language:Rust 8488
unstract
Zipstack / unstract
No-code LLM Platform to launch APIs and ETL Pipelines to structure unstructured documents
etl-pipeline llm-platform unstructured-data
Language:Python 5923
streampark
apache / streampark
Make stream processing easier! Easy-to-use streaming application development framework and operation platform.
streaming streampark apache development-framework easy-to-use etl-pipeline operation-platform
Language:Java 4225
orchest / orchest
Build data pipelines, the easy way 🛠️
data-science machine-learning pipelines ide jupyter cloud self-hosted jupyterlab notebooks docker python data-pipelines orchest deployment kubernetes airflow dag etl etl-pipeline
Language:TypeScript 4143
apache / hamilton
Apache Hamilton helps data scientists and engineers define testable, modular, self-documenting dataflows, that encode lineage/tracing and metadata. Runs and scales everywhere python does.
data-science python dag data-engineering dataframe etl etl-framework etl-pipeline feature-engineering machine-learning pandas software-engineering data-analysis lineage llmops mlops orchestration hacktoberfest rag
Language:Jupyter Notebook 2303
AlexIoannides / pyspark-example-project
Implementing best practices for PySpark ETL jobs and applications.
pyspark etl-job python data-engineering spark data-science etl etl-pipeline
Language:Python 2019
Udacity-Data-Engineering-Projects
san089 / Udacity-Data-Engineering-Projects
Few projects related to Data Engineering including Data Modeling, Infrastructure setup on cloud, Data Warehousing and Data Lake development.
data data-engineering data-engineering-pipeline etl-pipeline cassandra-database postgresql-database data-modeling data-warehouse data-lake airflow airflow-operators cluster cassandra infrastructure postgres aws aws-ec2 aws-sdk aws-s3 cloudformation
Language:Python 1758
goodreads_etl_pipeline
san089 / goodreads_etl_pipeline
An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.
etl-pipeline etl-framework spark apache-spark apache-airflow airflow redshift emr-cluster livy s3 warehouse data-lake scheduler data-migration data-engineering data-engineering-pipeline python goodreads-data-pipeline airflow-dag etl-job
Language:Python 1430
Open-Source-Legal / OpenContracts
Enterprise-grade and API-first LLM workspace for unstructured documents, including data extraction, redaction, rights management, prompt playground, and more!
agent agentic-ai etl etl-pipeline llm unstructured-data vector-database prompt-engineering
Language:TypeScript 949
stitchfix / hamilton
A scalable general purpose micro-framework for defining dataflows. THIS REPOSITORY HAS BEEN MOVED TO www.github.com/dagworks-inc/hamilton
python pandas dag data-science data-engineering numpy software-engineering etl-framework etl-pipeline etl feature-engineering featurization dataframe stitch-fix data-platform hamilton hamiltonian machine-learning
Language:Python 860
techascent / tech.ml.dataset
A Clojure high performance data processing system
clojure csv dataframe datascience dataset etl-pipeline java machine-learning xlsx
Language:Clojure 727
trustgraph-ai / trustgraph
Transform data into AI-ready context. Deploy knowledge graphs, private LLMs, and intelligent agents with complete data sovereignty. From data silos to actionable AI insights.
graphrag context-engineering event-driven knowledge-graph messaging pulsar streaming ai ai-memory knowledge neo4j open-source contributions-welcome help-wanted good-first-issue good-first-pr graph-rag mcp data-infrastructure data-sovereignty
Language:Python 724
SorellaLabs / brontes
A blazingly fast general purpose blockchain analytics engine specialized in systematic mev detection
ethereum etl-pipeline evm mev rust
Language:Rust 640
FlashLearn
Pravko-Solutions / FlashLearn
Integrate LLM in any pipeline - fit/predict pattern, JSON driven flows, and built in concurency support.
agentic-ai-development ai ai-agents ai-agents-framework concurrency etl-pipeline llm llm-agent python
Language:Python 605
YotpoLtd / metorikku
A simplified, lightweight ETL Framework based on Apache Spark
big-data distributed-computing etl etl-framework etl-pipeline scala spark sql
Language:Scala 589
DataWithBaraa / sql-data-warehouse-project
A comprehensive guide to building a modern data warehouse with SQL Server, including ETL processes, data modeling, and analytics.
data-analysis data-analytics data-cleaning data-engineering data-lakehouse data-science data-warehouse data-warehousing datalake datascience datawarehouse datawarehousing etl etl-job etl-pipeline medallion-architecture sql sql-query sql-server sqlserver
Language:TSQL 403
unbody-io / unbody
The Supabase of AI era. A modular, open-source backend for building AI-native software — designed for knowledge, not static data.
agentic-ai ai-native backend chatbot data-enhancement data-ingestion developer-tools etl-pipeline generative-ai knowledge-base llm rag supabase-alternative vector-database
Language:TypeScript 355
airscholar / e2e-data-engineering
An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All components are containerized with Docker for easy deployment and scalability.
apache-airflow apache-kafka apache-spark apache-zookeeper big-data cassandra containerization data-engineering data-pipeline data-processing data-storage docker etl-pipeline postgresql real-time-analytics
Language:Python 287
ebonnal / streamable
concurrent & fluent interface for (async) iterables
data-engineering etl-pipeline etl iterable reverse-etl collections streams fluent-interface iterator-pattern lazy-evaluation method-chaining visitor-pattern data decorator-pattern python asyncio concurrent-data-structure multiprocessing multithreading data-structures
Language:Python 285
jitsucom / bulker
Service for bulk-loading data to databases with automatic schema management (Redshift, Snowflake, BigQuery, ClickHouse, Postgres, MySQL)
data-engineering datawarehouse etl etl-pipeline ingestion pipeline
Language:Go 204
jvalue / jayvee
Jayvee is a domain-specific language and runtime for automated processing of data pipelines
data-engineering data-pipeline data-science domain-specific-language etl-pipeline typescript
Language:TypeScript 182
setl
SETL-Framework / setl
A simple Spark-powered ETL framework that just works 🍺
spark etl framework scala setl pipeline data-transformation data-science data-engineering data-analysis modularization dataset big-data etl-pipeline machine-learning
Language:Scala 182
data-engineering-community / data-engineering-project-template
This is a template you can use for your next data engineering portfolio project.
data-engineering sql python data data-warehouse etl etl-pipeline
181
imsanjoykb / Data-Science-Regular-Bootcamp
Regular practice on Data Science, Machien Learning, Deep Learning, Solving ML Project problem, Analytical Issue. Regular boost up my knowledge. The goal is to help learner with learning resource on Data Science filed.
artificial-intelligence data-analysis data-science data-science-notebook data-science-projects data-visualization database-connection deep-learning etl-pipeline etl-process feature-engineering machine-learning mysql-database neural-network numpy pandas postgresql python python-automation sqlite
Language:Jupyter Notebook 128
Wittline / uber-expenses-tracking
The goal of this project is to track the expenses of Uber Rides and Uber Eats through data Engineering processes using technologies such as Apache Airflow, AWS Redshift and Power BI.
airflow-docker apache-airflow aws aws-redshift data-engineering data-modeling etl-pipeline expenses-dashboard expenses-tracker power-bi python uber uber-data uber-eats
Language:Jupyter Notebook 121
mycelial / mycelial
Move your data with ease.
data-pipelines edge-computing etl etl-pipeline rust
Language:Rust 108
patterns-app / patterns-devkit
Data pipelines from re-usable components
data-analysis data-engineering data-pipeline data-pipelines data-science etl etl-framework etl-pipeline etl-pipelines functional-reactive-programming immutability pipelines sql
Language:Python 107
ContextData / VectorETL
Build super simple end-to-end data & ETL pipelines for your vector databases and Generative AI applications
cohere data datapipeline etl etl-framework etl-pipeline openai pinecone python qdrant qdrant-vector-database unstructured vector-database weaviate
Language:Python 104
usc-isi-i2 / dig-etl-engine
Download DIG to run on your laptop or server.
crawling information-extraction search-engine etl-framework etl-pipeline information-visualization
104
ApacheSpark
martandsingh / ApacheSpark
This repository will help you to learn about databricks concept with the help of examples. It will include all the important topics which we need in our real life experience as a data engineer. We will be using pyspark & sparksql for the development. At the end of the course we also cover few case studies.
apachespark data-analysis data-engineering database databricks datalake deltalake etl-pipeline hadoop hive spark spark-sql spark-streaming timetravel etl pyspark sql
Language:Python 103
restarone / violet_rails
an app engine for your business. Seamlessly implement business logic with a powerful API. Out of the box CMS, blog, forum and email functionality. Developer friendly & easily extendable for your next SaaS/XaaS project. Built with Rails 6, Devise, Sidekiq & PostgreSQL
rubyonrails xaas saas template multitenancy violet-rails multi-tenancy forum blog saas-boilerplate cms ruby-on-rails rails ruby wordpress-replacement ember emberjs etl-automation etl-framework etl-pipeline
Language:Ruby 103
sundios / SEO-Dashboard
SEO dashboard from Search console Data using the Google Search API, Mysql database , NodeJS RESTAPI( ExpressJS) and reactJs Dashboard
dashboard etl-kpi etl-pipeline expressjs google-search-console google-search-console-python mysql node-js react rest-api seo seo-monitor seotools
Language:JavaScript 94
Flowfile
Edwardvaneechoud / Flowfile
Flowfile is a visual ETL tool and Python library combining drag-and-drop workflows with Polars dataframes. Build data pipelines visually, define flows programmatically with a Polars-like API, and export to standalone Python code. Perfect for fast, intuitive data processing from development to production.
drag-and-drop electron-app etl etl-pipeline polars python visual-programming vue
Language:Python 90
Indexical-Metrics-Measure-Advisory / watchmen-matryoshka-doll
Watchmen Platform is a low code data platform for data pipeline, meta data management , analysis, and quality management
visualization charts data-visualization data-pipeline etl-pipeline data-quality-monitoring pipeline
Language:Python 90
runprism / prism
Prism is the easiest way to develop, orchestrate, and execute data pipelines in Python.
data data-engineering data-science etl etl-pipeline machine-learning orchestration pipeline python data-orc bigquery data-analysis data-integration dbt postgres redshift snowflake trino
Language:Python 86
onetl
MobileTeleSystems / onetl
One ETL tool to rule them all
etl etl-components etl-pipeline hwm spark
Language:Python 84

etl-pipeline

risingwavelabs / risingwave

Zipstack / unstract

apache / streampark

orchest / orchest

apache / hamilton

AlexIoannides / pyspark-example-project

san089 / Udacity-Data-Engineering-Projects

san089 / goodreads_etl_pipeline

Open-Source-Legal / OpenContracts

stitchfix / hamilton

techascent / tech.ml.dataset

trustgraph-ai / trustgraph

SorellaLabs / brontes

Pravko-Solutions / FlashLearn

YotpoLtd / metorikku

DataWithBaraa / sql-data-warehouse-project

unbody-io / unbody

airscholar / e2e-data-engineering

ebonnal / streamable

jitsucom / bulker

jvalue / jayvee

SETL-Framework / setl

data-engineering-community / data-engineering-project-template

imsanjoykb / Data-Science-Regular-Bootcamp

Wittline / uber-expenses-tracking

mycelial / mycelial

patterns-app / patterns-devkit

ContextData / VectorETL

usc-isi-i2 / dig-etl-engine

martandsingh / ApacheSpark

restarone / violet_rails

sundios / SEO-Dashboard

Edwardvaneechoud / Flowfile

Indexical-Metrics-Measure-Advisory / watchmen-matryoshka-doll

runprism / prism

MobileTeleSystems / onetl