emr-cluster

There are 5 repositories under emr-cluster topic.

goodreads_etl_pipeline
san089 / goodreads_etl_pipeline
An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.
etl-pipeline etl-framework spark apache-spark apache-airflow airflow redshift emr-cluster livy s3 warehouse data-lake scheduler data-migration data-engineering data-engineering-pipeline python goodreads-data-pipeline airflow-dag etl-job
Language:Python 1261
RubensZimbres / Repo-2019
BERT, AWS RDS, AWS Forecast, EMR Spark Cluster, Hive, Serverless, Google Assistant + Raspberry Pi, Infrared, Google Cloud Platform Natural Language, Anomaly detection, Tensorflow, Mathematics
aws-rds anomaly-detection googleassistant googlespeech keras-tensorflow sql-server raspberry-pi-3 tensorflow mathematica mathe wolfram-mathematica aws-emr-clusters pyspark hiveql emr-cluster bert-model bert
Language:Jupyter Notebook 136
aws-samples / aws-dbs-refarch-datalake
Reference Architectures for Datalakes on AWS
data-lake data-analytics amazon-emr ingest-data emr-cluster glue hive-metastore data-catalog data-transformation
Language:HTML 75
immu0001 / Udacity-Data-Engineer-nanodegree
Classwork projects and home works done through Udacity data engineering nano degree
spark data-analysis big-data etl data-pipelines classwork airflow-dags data-science data-lake-analytics s3-bucket emr-cluster redshift data-engineering-pipeline
Language:Jupyter Notebook 72
cloudposse / terraform-aws-emr-cluster
Terraform module to provision an Elastic MapReduce (EMR) cluster on AWS
hcl2 emr emr-cluster emrfs emr-notebooks terraform terraform-modules terraform-module terraform-aws hadoop hive presto spark
Language:HCL 71
dacort / demo-code
Bits of code I use during live demos
emr-cluster aws-emr emr-notebooks aws-cloudformation aws-cloudformation-templates aws-athena amazon-athena amazon-emr live-demos
Language:Jupyter Notebook 28
pyspark-on-aws-emr
Wittline / pyspark-on-aws-emr
The goal of this project is to offer an AWS EMR template using Spot Fleet and On-Demand Instances that you can use quickly. Just focus on writing pyspark code.
aws emr-cluster aws-emr python spark pyspark big-data-analytics big-data dataengineering wordcloud-generator ec2-spot ec2-spot-instances
Language:Python 25
camposvinicius / aws-etl
This is an ETL application on AWS with general open sales and customer data that you can find here: https://github.com/camposvinicius/data/blob/main/AdventureWorks.zip, it's a zipped file with some .csvs inside that we will apply transformations.
spark kubernetes airflow aws argocd athena emr pyspark glue catalog database postgres rds etl pipeline data data-engineer emr-cluster
Language:Smarty 15
dhiraa / spark-tpcds
Apache Spark TPC-DS benchmark setup with EMR launch setup
apache spark-sql benchmarking emr-cluster aws pulumi-aws
Language:Smarty 14
maelfabien / Cassandra-GDELT-Queries
A Cassandra Architecture for GDELT Database 🌍
big-data gdelt gdelt-events gdelt-data scala spark aws zeppelin architecture cassandra emr-cluster
Language:Shell 11
Signiant / dynamodb-emr-exporter
Uses EMR clusters to export dynamoDB tables to S3 and generates import steps
emr-cluster docker aws dynamodb dynamodb-backups
Language:Shell 11
xianwill / spark-boilerplate
A boilerplate for spark projects with docker support for local development and scripts for emr support.
spark boilerplate apache-spark docker emr emr-cluster
Language:Scala 9
Batch-ETL-with-AWS-EMR-and-MWAA
anthonywong611 / Batch-ETL-with-AWS-EMR-and-MWAA
Create a data pipeline on AWS to execute batch processing in a Spark cluster provisioned by Amazon EMR. ETL using managed airflow: extracts data from S3, transform data using spark, load transformed data back to S3.
aws-cloudformation emr-cluster s3-bucket airflow batch-processing
Language:Python 8
minhky2185 / healthcare_data_pipeline
An end-to-end data pipeline for building Data Lake and supporting report using Apache Spark.
analytics big-data data data-engineering data-engineering-pipeline data-lake emr-cluster mysql postgresql powerbi rds-mysql rds-postgres s3 spark apache-spark spark-cluster visualization healthcare-data
Language:Python 7
sjmiller8182 / Warehousing-Stock-Tweet-Data
A large-scale data framework that will enable us to store and analyze financial market data and drive future predictions for investment.
big-data hive aws stock-prices tweets hadoop emr-cluster aws-emr nyse nasdaq twitter data-warehouse star-schemas python3 snowflake-schema warehousing-stock-data
Language:TSQL 7
airscholar / EMR-for-data-engineers
This project demonstrates the use of Amazon Elastic Map Reduce (EMR) for processing large datasets using Apache Spark. It includes a Spark script for ETL (Extract, Transform, Load) operations, AWS command line instructions for setting up and managing the EMR cluster, and a dataset for testing and demonstration purposes.
apache-spark aws aws-s3 emr-cluster
Language:Python 4
bdoepf / aws-emr-prometheus
aws prometheus emr emr-cluster apache-spark apache-flink apache-hadoop
Language:HCL 3
fermat01 / ETL-Data-Pipeline-using-AWS-EMR-Spark-Glue-Athena
ETL Data pipeline using aws services
aws aws-athena aws-ec2 aws-glue-crawler emr-cluster
Language:Python 3
HarshadRanganathan / aws-emr-launcher
Generic python library that enables to provision emr clusters with yaml config files (Configuration as Code)
aws aws-emr emr-cluster
Language:Python 3
yennanliu / spark_emr_dev
Collection of code for submitting Spark/Hadoop/Hive/Pig tasks to EMR (AWS Elastic MapReduce) | #DE
emr emr-cluster hadoop hadoop-mapreduce hive pig spark spark-etl athena s3 scala pyspark mapreduce cluster
Language:Scala 3
AchilleasKn / AWS_EMR_cluster
Amazon EMR for Data Science
aws emr emr-cluster datascience big data apache-spark hadoop awscli jupyter spark pyspark
2
alex-ber / docker-hive
EMR 5.25.0 cluster single node Hadoop docker image. With Amazon Linux, Hadoop 2.8.5 and Hive 2.3.5
hadoop-docker hive docker docker-compose dockerfile docker-image hadoop-hdfs hadoop-mapreduce hadoop-cluster hadoop-ecosystem hadoop-framework hadoop-filesystem yarn-hadoop-cluster yarn hiveserver2 dockerfiles docker-images hadoop emr emr-cluster
Language:Shell 2
JohnnyLVP / Project-Standar-Documentation
This repository contains a definition of standar structure for Machine Learning and Data Pipelines Projects
standard machine-learning project documentation python pyspark redshift boto3 aws ec2 emr emr-cluster
Language:Python 2
mikeacosta / florasense
Orchestrating Cloud ETL Workloads
aws kinesis-stream redshift apache-spark lambda-functions emr-cluster etl-pipeline cloudformation step-functions data-lake data-warehouse redshift-spectrum
Language:Python 2
sepulworld / serverless-aws-emr-boilerplate
Event driven EMR via Serverless
emr-cluster emr serverless-framework serverless aws-lambda python3 aws-apigateway aws-sns
Language:Python 2
UdeshikaDissa / BigData-MapReduce
This BigData study intends to identify the most revenue-generating Taxi zones in New York City for the year 2019. Three MapReduce algorithms were developed and their performance was analyzed on different size of input datasets and different size clusters in EMR.
hadoop-mapreduce emr-cluster big-data java in-mapper-combiner aws
Language:Java 2
amrelauoty / Sparkify-Datalake-AWS
Data Engineering Expert Nanodegree - Data Lake on AWS using Spark and S3
aws aws-cli aws-s3 emr emr-cluster s3 s3-bucket s3-storage spark spark-sql
Language:Jupyter Notebook 1
arfatmateen / Data_Lake_and_ETL_Pipeline_on_AWS_using_Spark
Database Schema & ETL pipeline for Song Play Analysis | Bosch AI Talent Accelerator Scholarship Program
aws emr-cluster etl-pipeline jupyter-notebook pyspark python s3-bucket sql
Language:Python 1
EddieAmaitum / NYC-Yellow-Taxi-DataOps-with-AWS-Analyzing-TLC-Datasets
Performed business operations using Big data technologies: AWS EMR, AWS RDS (MySQL), Hadoop, Apache Scoop, Apache HBase, MapReduce
aws emr-cluster hadoop hbase linux-shell python rds-database sqoop mapreduce-jobs
Language:Python 1
AWS-EMR-APACHE-SPARK
jpb111 / AWS-EMR-APACHE-SPARK
Guide: Executing a python script on AWS EMR for big data analysis.
aws aws-ec2 aws-s3 emr-cluster pandas pyspark python emr-serverless
Language:Python 1
longNguyen010203 / Spark-Processing-AWS
👷🌇 Set up and build a big data processing pipeline with Apache Spark, 📦 AWS services (S3, EMR, EC2, IAM, VPC, Redshift) Terraform to setup the infrastructure and Integration Airflow to automate workflows🥊
apache-spark aws aws-ec2 aws-s3 emr-cluster iam pyspark redshift spark-master spark-worker aws-services cloud-computing data-pipeline spark-cluster terraform apache-airflow
Language:Python 1
manaswikamila05 / Public-Clickstream-Data-Analysis
Used a public clickstream dataset of a cosmetics store to extract data and gather insights. Launched an EMR 5.29.0 cluster that utilizes Hive services and used optimized hive queries to improve their sales by identifying customer behavior.
hive hadoop hdfs s3 emr-cluster hiveql partitioning
1
Morgan-Sell / usa-tourism-etl
Coalesced and transformed various data sources to create a comprehensive data lake for the USA tourism sector.
etl-pipeline data-engineering python spark emr-cluster aws data-lake
Language:Jupyter Notebook 1
nileshsingal / PUBG-DATA-ANALYSIS
Player Unknown's Battlegrounds (PUBG), is a first person shooter game where the goal is to be the last player standing. You are placed on a giant circular map that shrinks as the game goes on, and you must find weapons, armor, and other supplies in order to kill other players / teams and survive.
bigdata emr-cluster s3-bucket spark hive api-gateway aws-lambda aws-cloudformation tableau
Language:Python 1
Recipe_Recommender_Asssignment_EDA_Using_PySpark
shantamgarg24 / Recipe_Recommender_Asssignment_EDA_Using_PySpark
Used Amazon AWS and PySpark to solve this EDA assignment
amazon-web-services apache-spark data-science emr-cluster exploratory-data-analysis pyspark recommender-system
Language:Jupyter Notebook 1
Tanay0510 / Data-Lake-with-Spark
Load data from S3, process the data into analytics tables using Spark and load them back into S3. Deployed this Spark process on a cluster using AWS EMR
s3 emr-cluster etl-pipeline spark datalake
Language:Python 1

emr-cluster

san089 / goodreads_etl_pipeline

RubensZimbres / Repo-2019

aws-samples / aws-dbs-refarch-datalake

immu0001 / Udacity-Data-Engineer-nanodegree

cloudposse / terraform-aws-emr-cluster

dacort / demo-code

Wittline / pyspark-on-aws-emr

camposvinicius / aws-etl

dhiraa / spark-tpcds

maelfabien / Cassandra-GDELT-Queries

Signiant / dynamodb-emr-exporter

xianwill / spark-boilerplate

anthonywong611 / Batch-ETL-with-AWS-EMR-and-MWAA

minhky2185 / healthcare_data_pipeline

sjmiller8182 / Warehousing-Stock-Tweet-Data

airscholar / EMR-for-data-engineers

bdoepf / aws-emr-prometheus

fermat01 / ETL-Data-Pipeline-using-AWS-EMR-Spark-Glue-Athena

HarshadRanganathan / aws-emr-launcher

yennanliu / spark_emr_dev

AchilleasKn / AWS_EMR_cluster

alex-ber / docker-hive

JohnnyLVP / Project-Standar-Documentation

mikeacosta / florasense

sepulworld / serverless-aws-emr-boilerplate

UdeshikaDissa / BigData-MapReduce

amrelauoty / Sparkify-Datalake-AWS

arfatmateen / Data_Lake_and_ETL_Pipeline_on_AWS_using_Spark

EddieAmaitum / NYC-Yellow-Taxi-DataOps-with-AWS-Analyzing-TLC-Datasets

jpb111 / AWS-EMR-APACHE-SPARK

longNguyen010203 / Spark-Processing-AWS

manaswikamila05 / Public-Clickstream-Data-Analysis

Morgan-Sell / usa-tourism-etl

nileshsingal / PUBG-DATA-ANALYSIS

shantamgarg24 / Recipe_Recommender_Asssignment_EDA_Using_PySpark

Tanay0510 / Data-Lake-with-Spark