data-engineering-pipeline

There are 56 repositories under data-engineering-pipeline topic.

Udacity-Data-Engineering-Projects
san089 / Udacity-Data-Engineering-Projects
Few projects related to Data Engineering including Data Modeling, Infrastructure setup on cloud, Data Warehousing and Data Lake development.
airflow airflow-operators aws aws-ec2 aws-s3 aws-sdk cassandra cassandra-database cloudformation cluster data data-engineering data-engineering-pipeline data-lake data-modeling data-warehouse etl-pipeline infrastructure postgres postgresql-database
Language:Python 1402
goodreads_etl_pipeline
san089 / goodreads_etl_pipeline
An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.
etl-pipeline etl-framework spark apache-spark apache-airflow airflow redshift emr-cluster livy s3 warehouse data-lake scheduler data-migration data-engineering data-engineering-pipeline python goodreads-data-pipeline airflow-dag etl-job
Language:Python 1254
versatile-data-kit
vmware / versatile-data-kit
One framework to develop, deploy and operate data workflows with Python and SQL.
data-science data-engineering sql trino data-lineage etl elt data-pipelines data-engineer data-warehouse warehouse analytics snowflake dataops data-engineering-pipeline python data pipeline data-structures database
Language:Python 417
alanchn31 / Movalytics-Data-Warehouse
Data pipeline performing ETL to AWS Redshift using Spark, orchestrated with Apache Airflow
docker airflow spark sql python3 redshift data-engineering-pipeline data-engineer-nanodegree pyspark data-engineering movie-database aws-s3 aws-redshift analytics data-warehouse-cloud data-modelling movie-recommendation movie-reviews udacity
Language:Python 123
anna-geller / dataflow-ops
Project demonstrating how to automate Prefect 2.0 deployments to AWS ECS Fargate
automation aws dataflow orchestration prefect serverless analytics analytics-engineering cicd data data-engineering data-science infrastructure-as-code observability pipeline python dataflow-ops data-engineering-infrastructure data-engineering-pipeline
Language:Python 109
anna-geller / prefect-deployment-patterns
Code examples showing flow deployment to various types of infrastructure
automation aws data data-engineering data-engineering-pipeline data-engineering-team data-science dataflow orchestration pipeline prefect python serverless serverless-framework dataflow-ops data-products data-engineering-infrastructure
Language:Python 97
immu0001 / Udacity-Data-Engineer-nanodegree
Classwork projects and home works done through Udacity data engineering nano degree
spark data-analysis big-data etl data-pipelines classwork airflow-dags data-science data-lake-analytics s3-bucket emr-cluster redshift data-engineering-pipeline
Language:Jupyter Notebook 72
xontrib-pipeliner
anki-code / xontrib-pipeliner
Let your pipe lines flow thru the Python code in xonsh.
xonsh xontrib pipe pipeline pipelines python shell data-engineering data-engineering-pipeline
Language:Python 56
anna-geller / prefect-aws-lambda
Deploy a Prefect flow to serverless AWS Lambda function
aws aws-lambda data-engineering data-engineering-infrastructure data-engineering-pipeline dataflow dataflow-ops event-driven event-driven-architecture lambda python serverless serverless-framework automation cicd data-science pipeline
Language:Python 35
mikeroyal / Apache-Spark-Guide
Apache Spark Guide
spark spark-streaming data-engineering pyspark apache-spark machine-learning big-data data-science data-engineering-pipeline awesome awesome-automations awesome-list
Language:Python 28
InosRahul / f1-data-pipeline
F1 Data Pipeline
bigquery data-engineering-pipeline dbt gcs looker prefect python terraform
Language:Python 23
VeraZab / nyc-stats
Analysis of 311 Service Requests for the City of NYC (from 2010 to 2023) Tech: Prefect cloud, dbt core, BigQuery, Compute Engine, CloudRun, Artifact Registry, Terraform, Docker
gcp prefect terraform dbt looker-studio cloudrun-jobs compute-engine data-engineering-pipeline ci-cd docker
Language:Python 20
gear5sh / Gear5
high performance better alternative to Airbyte, Singer, Meltano
golang data data-collection data-engineering data-engineering-pipeline data-ingestion elt etl-framework airbyte meltano singer singer-io singer-tap etl g5 gear5
Language:Go 16
sanjeevai / disaster-response-pipeline
ETL pipeline combined with supervised learning and grid search to classify text messages sent during a disaster event
grid-search-hyperparameters etl-pipeline supervised-learning disaster-event sqlite-database data-engineering-pipeline
Language:Python 16
kishlayjeet / Stock-Market-Real-Time-Data-Pipeline-with-Apache-Kafka-and-Cassandra
A end-to-end real-time stock market data pipeline with Python, AWS EC2, Apache Kafka, and Cassandra Data is processed on AWS EC2 with Apache Kafka and stored in a local Cassandra database.
apache-kafka cassandra kafka pipeline python stock-market stock-market-data-pipeline aws aws-ec2 data-engineering data-engineering-pipeline data-pipeline ec2 ec2-instance etl etl-pipeline kafka-streaming kafka-streams real-time real-time-data-pipeline
Language:Python 15
san089 / data-engineer-roadmap
Learning from multiple companies in Silicon Valley. Netflix, Facebook, Google, Startups
data data-engineering data-engineering-pipeline carrers data-engineer etl
15
brunocampos01 / predicting-retail-churn-with-azure-ml-studio
Challenge to job: Data Scientist
azure azure-pipelines azure-machine-learning-studio machine-learning-studio softplan challenge python3 pandas machine-learning deploy-machine-learning api-rest data-engineering data-engineering-pipeline powerbi azure-machine-learning-services cheat-sheets cheat-sheet-machine-learning data-scientist python data-science
Language:Python 14
dylanzenner / business_closures_de_pipeline
Data Engineering pipeline hosted entirely in the AWS ecosystem utilizing DocumentDB as the database
documentdb data-engineering aws-ecosystem data-engineering-pipeline aws aws-secretsmanager aws-ec2-bastion slack quicksight aws-ssm-document aws-s3 aws-lambda aws-cloudwatch-events
Language:Python 14
Alero-Awani / Batch-data-engineering-project
A batch Data Pipeline that retrieves data from a user purchase table and a movie review table and is transformed to form a user behaviour metric table.
airflow aws-s3 data-engineering-pipeline docker pipeline pyspark sql terraform
Language:HCL 13
ketgo / marshmallow-pyspark
Marshmallow serializer integration with pyspark
data-cleaning data-engineering data-engineering-pipeline data-pipelines data-schemas marshmallow pyspark schema spark
Language:Python 12
Data-Engineering-Project-with-HDFS-and-Kafka
AhmetFurkanDEMIR / Data-Engineering-Project-with-HDFS-and-Kafka
Data Engineering Project with Hadoop HDFS and Kafka
data data-engineer data-engineering data-engineering-pipeline docker docker-compose hadoop hadoop-filesystem hadoop-hdfs hdfs hdfs-client hdfs-dfs kafka kafka-consumer kafka-producer kafka-ui pipline python python-hdfs-client kafkaui
Language:Python 10
antimoz-om / Antimoz
A data engineering pipeline for digital marketers.
data-analytics hadoop digital-marketing data-engineering-pipeline kafka
Language:Shell 10
kishlayjeet / Twitter-Data-Pipeline-using-Airflow-and-AWS-S3
An end-to-end Twitter Data Pipeline that extracts data from Twitter and loads it into AWS S3.
airflow airflow-dags apache-airflow boto3 data-engineering data-engineering-pipeline data-pipeline etl etl-job etl-pipeline python s3 scheduler tweepy twitter twitter-api twitter-data-pipeline
Language:Python 10
koksang / social-media-analysis
Social Media Analysis, scalable solution, flexible deployment that analyses social media contents
bigquery python ray kafka apache-airflow social-media twitter python3 data-engineering data-engineering-pipeline dbt gcp etl
Language:Jupyter Notebook 10
datarootsio / notion-dbs-data-quality
Using Great Expectations and Notion's API, this repo aims to provide data quality for our databases in Notion.
data-quality notion notion-api great-expectations data-engineering-pipeline notion-database
Language:Python 9
yashksaini-coder / Python-for-Data-Engineering
Data Engineering 🛠️ is like the backbone of data processing 📊, managing data pipelines 🚀, warehouses 🏢, and lakes 🌊. It's the bridge 🌉 between raw data and actionable insights, powering businesses 🚀 with efficient data management and analytics 📈.
aws data-engineer data-engineering data-engineering-pipeline data-science kafka python
Language:Jupyter Notebook 9
AlphanAksoyoglu / tweeter-etl-pipeline
A streaming ETL pipeline for Realtime Tweet Collection, Analysis and Reporting
data-engineering-pipeline data-engineering tweeter tweeter-stream-api tweepy-library sentiment-analysis slackbot mongodb-database postgresql
Language:Python 8
anna-geller / dataflow-ops-aws-eks
Project demonstrating how to automate Prefect 2.0 deployments to AWS EKS
automation aws data eks eks-cluster eksctl karpenter kubernetes kubernetes-deployment kubernetes-setup python serverless cicd data-engineering data-engineering-pipeline data-products dataflow dataflow-ops data-engineering-infrastructure
Language:Python 8
benedekrozemberczki / AV_Ultimate_Student_Hunt
Solution for the Ultimate Student Hunt Challenge (1st place).
winning-entry machine-learning r analytics-vidhya-competition kaggle driven-data competition xgboost gradient-boosting extreme-gradient-boosting supervised-learning distributed-machine-learning data-cleaning forecasting weather-forecast data-engineering data-engineering-pipeline student-hunt
Language:R 8
longNguyen010203 / Youtube-ETL-Pipeline
💜🌈📊 A Data Engineering Project that implements an ETL data pipeline using Dagster, Apache Spark, Streamlit, MinIO, Metabase, Dbt, Polars, Docker. Data from kaggle and youtube-api 🌺
dagster etl-pipeline minio spark dbt docker docker-compose dockerfile mysql postgresql cleaning-data data-engineering data-engineering-pipeline pyspark processing streamlit polars youtube youtube-api metabase
Language:Jupyter Notebook 8
siddharth271101 / Covid-19-and-Aviation-Industry
The goal of this project is to analyse the impact of Covid-19 on the Aviation industry through data engineering processes using technologies such as Apache Airflow, Apache Spark, Tableau and couple of AWS services
analytics athena aws-s3 covid-19 dashboard docker emr opensky tableau airflow aws data-engineering data-engineering-pipeline pyspark python3 s3 spark
Language:Python 8
anna-geller / prefect-getting-started
Get started with Prefect by scheduling your Prefect flows with GitHub Actions
analytics-engineering automation cicd data data-engineering data-engineering-infrastructure data-engineering-pipeline data-pipeline data-science dataflow dataflow-ops github-actions orchestration pipeline prefect python scheduling serverless
Language:Python 7
DeleLinus / HFR-Data-Warehousing
End-to-end data engineering processes for the NIGERIA Health Facility Registry (HFR). The project leveraged Selenium, Pandas, PySpark, PostgreSQL and Airflow
airflow airflow-dags apache-spark data-engineer data-engineering-pipeline data-ingestion data-mining data-science data-warehouse-architecture data-warehousing database-design elt erdiagram nigerian-data postgresql-database pyspark python selenium-python etl-automation etl-pipeline
Language:Python 7
minhky2185 / healthcare_data_pipeline
An end-to-end data pipeline for building Data Lake and supporting report using Apache Spark.
analytics big-data data data-engineering data-engineering-pipeline data-lake emr-cluster mysql postgresql powerbi rds-mysql rds-postgres s3 spark apache-spark spark-cluster visualization healthcare-data
Language:Python 7
Yan-Luo-AU / Data_Engineer_Project_ETL_BI
This is an ETL project - extracting data from an ecommerce transactional database on RDS, transforming the data using AWS glue job, and loading it to a Redshift data warehouse, and connected it to Tableau for BI
data-engineering-pipeline redshift rds-mssql glue datapipeline secret-manager
Language:Python 7
BayoAdejare / lightning-containers
Docker powered starter for geospatial analysis of lightning atmospheric data.
clustering-analysis csv-files data-engineer data-engineering-pipeline data-warehouse databases docker jupyter machine-learning-algorithms noaa-weather orchestrator pandas python3 spatialite sqlite streamlit-dashboard
Language:Jupyter Notebook 6

data-engineering-pipeline

san089 / Udacity-Data-Engineering-Projects

san089 / goodreads_etl_pipeline

vmware / versatile-data-kit

alanchn31 / Movalytics-Data-Warehouse

anna-geller / dataflow-ops

anna-geller / prefect-deployment-patterns

immu0001 / Udacity-Data-Engineer-nanodegree

anki-code / xontrib-pipeliner

anna-geller / prefect-aws-lambda

mikeroyal / Apache-Spark-Guide

InosRahul / f1-data-pipeline

VeraZab / nyc-stats

gear5sh / Gear5

sanjeevai / disaster-response-pipeline

kishlayjeet / Stock-Market-Real-Time-Data-Pipeline-with-Apache-Kafka-and-Cassandra

san089 / data-engineer-roadmap

brunocampos01 / predicting-retail-churn-with-azure-ml-studio

dylanzenner / business_closures_de_pipeline

Alero-Awani / Batch-data-engineering-project

ketgo / marshmallow-pyspark

AhmetFurkanDEMIR / Data-Engineering-Project-with-HDFS-and-Kafka

antimoz-om / Antimoz

kishlayjeet / Twitter-Data-Pipeline-using-Airflow-and-AWS-S3

koksang / social-media-analysis

datarootsio / notion-dbs-data-quality

yashksaini-coder / Python-for-Data-Engineering

AlphanAksoyoglu / tweeter-etl-pipeline

anna-geller / dataflow-ops-aws-eks

benedekrozemberczki / AV_Ultimate_Student_Hunt

longNguyen010203 / Youtube-ETL-Pipeline

siddharth271101 / Covid-19-and-Aviation-Industry

anna-geller / prefect-getting-started

DeleLinus / HFR-Data-Warehousing

minhky2185 / healthcare_data_pipeline

Yan-Luo-AU / Data_Engineer_Project_ETL_BI

BayoAdejare / lightning-containers