aiwithqasim/emr-batch-processing

About Batch Data Pipeline:

The Wikipedia Activity data will be put into a folder in the S3 bucket. We will have PySpark code that will run on the EMR cluster. This code will fetch the data from the S3 bucket, perform filtering and aggregation on this data, and push the processed data back into S3 in another folder. We will then use Athena to query this processed data present in S3. We will create a table on top of the processed data by providing the relevant schema and then use ANSI SQL to query the data.

Full Blog link: Batch Processing using PySpark on AWS EMR

Architecture Diagram:

Languages - Python
Package - PySpark
Services - AWS EMR, AWS S3, AWS Athena.

Dataset:

We'll be using the Wikipedia activity logs JSON dataset that has a huge payload comprising 15+ fields

NOTE: In our Script created we'll take two conditions into consideration that we want only those payloads where _isRobot _ is False & user Country is from United Estate

For more such content please follow:

LinkedIn: https://www.linkedin.com/in/qasimhassan/
GitHub: https://github.com/aiwithqasim
Join our AWS Data Engineering WhastApp Group

About Batch Data Pipeline:

Architecture Diagram:

Dataset:

About

Languages