There are 5 repositories under emr-cluster topic.
An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.
BERT, AWS RDS, AWS Forecast, EMR Spark Cluster, Hive, Serverless, Google Assistant + Raspberry Pi, Infrared, Google Cloud Platform Natural Language, Anomaly detection, Tensorflow, Mathematics
Reference Architectures for Datalakes on AWS
Classwork projects and home works done through Udacity data engineering nano degree
Terraform module to provision an Elastic MapReduce (EMR) cluster on AWS
The goal of this project is to offer an AWS EMR template using Spot Fleet and On-Demand Instances that you can use quickly. Just focus on writing pyspark code.
This is an ETL application on AWS with general open sales and customer data that you can find here: https://github.com/camposvinicius/data/blob/main/AdventureWorks.zip, it's a zipped file with some .csvs inside that we will apply transformations.
Apache Spark TPC-DS benchmark setup with EMR launch setup
A Cassandra Architecture for GDELT Database 🌍
Uses EMR clusters to export dynamoDB tables to S3 and generates import steps
A boilerplate for spark projects with docker support for local development and scripts for emr support.
Create a data pipeline on AWS to execute batch processing in a Spark cluster provisioned by Amazon EMR. ETL using managed airflow: extracts data from S3, transform data using spark, load transformed data back to S3.
An end-to-end data pipeline for building Data Lake and supporting report using Apache Spark.
A large-scale data framework that will enable us to store and analyze financial market data and drive future predictions for investment.
This project demonstrates the use of Amazon Elastic Map Reduce (EMR) for processing large datasets using Apache Spark. It includes a Spark script for ETL (Extract, Transform, Load) operations, AWS command line instructions for setting up and managing the EMR cluster, and a dataset for testing and demonstration purposes.
ETL Data pipeline using aws services
Generic python library that enables to provision emr clusters with yaml config files (Configuration as Code)
Collection of code for submitting Spark/Hadoop/Hive/Pig tasks to EMR (AWS Elastic MapReduce) | #DE
Amazon EMR for Data Science
EMR 5.25.0 cluster single node Hadoop docker image. With Amazon Linux, Hadoop 2.8.5 and Hive 2.3.5
This repository contains a definition of standar structure for Machine Learning and Data Pipelines Projects
Orchestrating Cloud ETL Workloads
Event driven EMR via Serverless
This BigData study intends to identify the most revenue-generating Taxi zones in New York City for the year 2019. Three MapReduce algorithms were developed and their performance was analyzed on different size of input datasets and different size clusters in EMR.
Data Engineering Expert Nanodegree - Data Lake on AWS using Spark and S3
Database Schema & ETL pipeline for Song Play Analysis | Bosch AI Talent Accelerator Scholarship Program
Performed business operations using Big data technologies: AWS EMR, AWS RDS (MySQL), Hadoop, Apache Scoop, Apache HBase, MapReduce
Guide: Executing a python script on AWS EMR for big data analysis.
👷🌇 Set up and build a big data processing pipeline with Apache Spark, 📦 AWS services (S3, EMR, EC2, IAM, VPC, Redshift) Terraform to setup the infrastructure and Integration Airflow to automate workflows🥊
Used a public clickstream dataset of a cosmetics store to extract data and gather insights. Launched an EMR 5.29.0 cluster that utilizes Hive services and used optimized hive queries to improve their sales by identifying customer behavior.
Coalesced and transformed various data sources to create a comprehensive data lake for the USA tourism sector.
Player Unknown's Battlegrounds (PUBG), is a first person shooter game where the goal is to be the last player standing. You are placed on a giant circular map that shrinks as the game goes on, and you must find weapons, armor, and other supplies in order to kill other players / teams and survive.
Used Amazon AWS and PySpark to solve this EDA assignment
Load data from S3, process the data into analytics tables using Spark and load them back into S3. Deployed this Spark process on a cluster using AWS EMR