There are 3 repositories under aws-emr-clusters topic.
BERT, AWS RDS, AWS Forecast, EMR Spark Cluster, Hive, Serverless, Google Assistant + Raspberry Pi, Infrared, Google Cloud Platform Natural Language, Anomaly detection, Tensorflow, Mathematics
Terraform module to create AWS EMR resources 🇺🇦
Run a Spark job within Amazon EMR
EMR + Hadoop to Redshift ELT workflow using spark steps API and orchestrated by Apache-Airflow, which ingests disparate datasets focused around 7Gb of I94 arrivals information to produce a simple star schema in Redshift
Lambda to start EMR and run a map reduce job
Daily Incremental load ETL pipeline for Ecommerce company using AWS Lambda and AWS EMR cluster, Deployed using Apache airflow in a docker container.
Detect Tight Communities in a social Network
Data Engineering Projects including Data Modeling, Data Warehouse, Data Lake Development
Performing various product review analysis on Amazon dataset using Apache Spark and MongoDB
Load data from the Million Song Dataset into a final dimensional model stored in S3.
Credit defaulting results in a large profit loss to banks and other credit lenders. The success of the banking industry results in the ability to understand risk. This project uses big data technologies like Mapreduce, HDFS along with PySpark and AWS for analysis of credit history and its prediction
An opinionated framework for running big data jobs
With this app, you can see what programming skills are most in-demand in the current job market.
Example for provisioning AWS EMR service with Terraform
Stand-alone Scala & Java tool to anonymize OOXML Documents (DOCX)
Data Pipeline Analytics Platform is an end-to-end generic Big Data pipeline. Involves following tech stack: AWS S3, AWS Redshift, AWS EMR Cluster, Apache Spark, Apache Airflow.
PySpark RDD and DataFrame Examples
AWS EMR backed Spark cluster for analyzing Yelp Data
Udacity project: implementing an ETL to process data with Apache Spark and store them in AWS S3 storage
In this repo, I build a LogisticRegression prediction model with Dask and PySpark and initialize an AWS EMR cluster to run the entire pipeline.
Implemented random forest machine learning algorithm using pyspark on AWS EMR to classify the wines. The model is then deployed in docker container.
Define a big data architecture and perform distributed machine learning calculations on an EMR cluster using AWS
Built a data model, data warehouse and pipeline for extracting transforming and loading data into a star schema-based data model in a redshift database
ETL Pipeline extracts JSON files from AWS S3 bucket and transforms these using an AWS EMR Spark Cluster and stores the data into an AWS S3 bucket in parquet file format.
A CNN is deployed in AWS to extract image features in the context of distributed computing.
MLP for Sentiment Analysis on Movie's Reviews.
Realtime data pipeline
Predicting customer churn for the music app, Sparkify, using PySpark on AWS EMR clusters
A Cloud based Reddit stock sentiment analyzer that analyzes overall sentiment from a configurable selection of stock subreddits for each stock. The architecture utilizes AWS MSK (Kafka), AWS EMR (PySpark) and AWS Lambda (Python 3) for maximum scalability and the OpenAI API for sentiment analysis through prompt engineering.
A scalable prototype of an image recognition engine deployed on AWS.
TU Berlin Cloud Computing - correctly implemented assignment4