There are 48 repositories under etl-pipeline topic.
Streaming data platform. Real-time stream processing, low-latency serving, and Iceberg table management.
No-code LLM Platform to launch APIs and ETL Pipelines to structure unstructured documents
Make stream processing easier! Easy-to-use streaming application development framework and operation platform.
Apache Hamilton helps data scientists and engineers define testable, modular, self-documenting dataflows, that encode lineage/tracing and metadata. Runs and scales everywhere python does.
Implementing best practices for PySpark ETL jobs and applications.
Few projects related to Data Engineering including Data Modeling, Infrastructure setup on cloud, Data Warehousing and Data Lake development.
An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.
Enterprise-grade and API-first LLM workspace for unstructured documents, including data extraction, redaction, rights management, prompt playground, and more!
A scalable general purpose micro-framework for defining dataflows. THIS REPOSITORY HAS BEEN MOVED TO www.github.com/dagworks-inc/hamilton
A Clojure high performance data processing system
Transform data into AI-ready context. Deploy knowledge graphs, private LLMs, and intelligent agents with complete data sovereignty. From data silos to actionable AI insights.
A blazingly fast general purpose blockchain analytics engine specialized in systematic mev detection
Integrate LLM in any pipeline - fit/predict pattern, JSON driven flows, and built in concurency support.
A comprehensive guide to building a modern data warehouse with SQL Server, including ETL processes, data modeling, and analytics.
The Supabase of AI era. A modular, open-source backend for building AI-native software — designed for knowledge, not static data.
An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All components are containerized with Docker for easy deployment and scalability.
concurrent & fluent interface for (async) iterables
Service for bulk-loading data to databases with automatic schema management (Redshift, Snowflake, BigQuery, ClickHouse, Postgres, MySQL)
Jayvee is a domain-specific language and runtime for automated processing of data pipelines
A simple Spark-powered ETL framework that just works 🍺
This is a template you can use for your next data engineering portfolio project.
Regular practice on Data Science, Machien Learning, Deep Learning, Solving ML Project problem, Analytical Issue. Regular boost up my knowledge. The goal is to help learner with learning resource on Data Science filed.
The goal of this project is to track the expenses of Uber Rides and Uber Eats through data Engineering processes using technologies such as Apache Airflow, AWS Redshift and Power BI.
Data pipelines from re-usable components
Build super simple end-to-end data & ETL pipelines for your vector databases and Generative AI applications
Download DIG to run on your laptop or server.
This repository will help you to learn about databricks concept with the help of examples. It will include all the important topics which we need in our real life experience as a data engineer. We will be using pyspark & sparksql for the development. At the end of the course we also cover few case studies.
an app engine for your business. Seamlessly implement business logic with a powerful API. Out of the box CMS, blog, forum and email functionality. Developer friendly & easily extendable for your next SaaS/XaaS project. Built with Rails 6, Devise, Sidekiq & PostgreSQL
SEO dashboard from Search console Data using the Google Search API, Mysql database , NodeJS RESTAPI( ExpressJS) and reactJs Dashboard
Flowfile is a visual ETL tool and Python library combining drag-and-drop workflows with Polars dataframes. Build data pipelines visually, define flows programmatically with a Polars-like API, and export to standalone Python code. Perfect for fast, intuitive data processing from development to production.
Watchmen Platform is a low code data platform for data pipeline, meta data management , analysis, and quality management
Prism is the easiest way to develop, orchestrate, and execute data pipelines in Python.