There are 9 repositories under aws-emr topic.
Spark 2.0 Python Machine Learning examples
Spark 2.0 Scala Machine Learning examples
An AWS based solution using AWS CloudWatch and AWS Lambda based on Python to automatically terminate AWS EMR clusters that have been idle for a specified period of time.
The goal of this project is to offer an AWS EMR template using Spot Fleet and On-Demand Instances that you can use quickly. Just focus on writing pyspark code.
Terraform module to create AWS EMR resources 🇺🇦
A batch processing data pipeline, using AWS resources (S3, EMR, Redshift, EC2, IAM), provisioned via Terraform, and orchestrated from locally hosted Airflow containers. The end product is a Superset dashboard and a Postgres database, hosted on an EC2 instance at this address (powered down):
Use aws-emr and aws-redshift to analyse dataset of adult census of USA
Run a Spark job within Amazon EMR
A Grafana-based application to assist Big Data infrastructure optimization initiatives where Spark applications are a dominant cost driver
A collection of airflow sample workflows for data processing on aws
A Spark application, written in Python, to figure out strongly connected components with Bi-directional Label Propagation algorithm. This project implemented an 1.3GB Twitter network dataset on AWS EMR cluster.
Create Data Lake on AWS S3 to store dimensional tables after processing data using Spark on AWS EMR cluster
A cookiecutter template for working with PySpark on AWS EMR
Data Engineering Project with Terraform, Spark, AWS, Docker, Airflow and other tools
A large-scale data framework that will enable us to store and analyze financial market data and drive future predictions for investment.
Spark 2.0 R/SparkR Machine Learning examples
My AWS Playground
EMR + Hadoop to Redshift ELT workflow using spark steps API and orchestrated by Apache-Airflow, which ingests disparate datasets focused around 7Gb of I94 arrivals information to produce a simple star schema in Redshift
Data Analysis Exercise over Walmart Stock
Daily Incremental load ETL pipeline for Ecommerce company using AWS Lambda and AWS EMR cluster, Deployed using Apache airflow in a docker container.
We Build an ETL pipeline using Airflow that accomplishes the following: Downloads data from an AWS S3 bucket, Runs a Spark/Spark SQL job on the downloaded data producing a cleaned-up dataset of delivery deadline missing orders and then Upload the cleaned-up dataset back to the same S3 bucket in a folder primed for higher level analytics
Lambda to start EMR and run a map reduce job
Generic python library that enables to provision emr clusters with yaml config files (Configuration as Code)
MapReduce Analysis on Amazon Food Review Dataset (Big-Data)
ETL pipeline with PySpark on EMR orchestrated with Airflow
Analysed New York City's Yellow taxi data set with Big Data tools such as Hadoop, HBase, Sqoop, MapReduce and AWS Cloud Infrastructure.
This project analyzes the correlation between COVID-19 and the US aviation industry. By studying data on passenger/freight traffic and delays alongside COVID-19 trends, it provides insights into airline and passenger responses. The findings help airlines adapt to the pandemic's impact.
CMPT 732 Project - Dealt with 3 large scale databases by joining them to analysis the economic impact of Covid-19 on the airline industry. Fetched data using API and stored in AWS S3 that is retrieved by an AWS EMR cluster that does data computation. Queried into AWS Athena and visualized the results on Tableau by implementing static and dynamic dashboards.
Analysis and monitoring system using AWS... Also the comp4442 project