Wittline / apache-spark-course

Apache Spark with python

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Week 1 - Class 1

Week 1 - Class 2

Week 1 - Class 3

Week 2 - Class 1

Week 2 - Class 2

Week 2 - Class 3

Week 3 - Class 1- EMR Serverless

Create Roles

  1. Create EMR Notebook Role
    "Version": "2008-10-17",
    "Statement": [
            "Sid": "",
            "Effect": "Allow",
            "Principal": {
                "Service": "elasticmapreduce.amazonaws.com"
            "Action": "sts:AssumeRole"

  • Attach AmazonElasticMapReduceEditorsRole policy
  • Attached AmazonS3FullAccess policy
  1. Create EMR Servlerless Execution Role
    "Version": "2012-10-17",
    "Statement": [
            "Effect": "Allow",
            "Principal": {
                "Service": "emr-serverless.amazonaws.com"
            "Action": "sts:AssumeRole"

    "Version": "2012-10-17",
    "Statement": [
            "Sid": "ReadAccessForEMRSamples",
            "Effect": "Allow",
            "Action": [
            "Resource": [
            "Sid": "FullAccessToOutputBucket",
            "Effect": "Allow",
            "Action": [
            "Resource": [
            "Sid": "GlueCreateAndReadDataCatalog",
            "Effect": "Allow",
            "Action": [
            "Resource": [

Create S3 Buckets

  1. Create a new S3 bucket
  • Open S3 console
  • create S3 bucket to use for this class
  1. Create Folders to use in S3 Bucket
  • Create a pyspark folder
  • Create a hive folder
  • Create a datasets folder (We use this to upload a CSV to)
  • Create a outputs folder
  • Create a results folder
  • Upload files to folders

EMR Studio

  1. Naviagte to EMR home from the AWS Console and select EMR Studio from the left handside.

  2. Select Get Started

  3. Select Create Studio

  4. Insert Studio name

  5. Under Networking and Security select your default VPC and 3 public subnets.

  6. Select the EMR Studio role emr-notebook-role created initially

  7. Select the S3 bucket created initially.

  8. Select the Studio access URL

Spark App

  1. Select applications under serverless from the left handside menu

  2. Select create application from the top right

  3. Enter a name for the Application. Leave the type as Spark and click create application

  4. Click into the application via the name

  5. Click submit job

  6. Name job and select the service role created in the set up steps.

  7. Click Submit Job

  8. job status will go from pending -> running -> (success or failed).

Hive App

  1. Create Application from applications

  2. Name and select Hive application

  3. Open hive application

  4. Submit the job

  5. Name the hive job, select hive script (change bucket name in script),and select service role.

  6. Copy and paste Hive config (change bucket name in json).

  7. Submit Job and monintor. Job status will go from pending -> running -> success.

  8. Navigate to Glue databases and click emrdb

  9. Check the table created

  10. Select data using AWS Athena and check the created table.

Week 3 - Class 2 - Dataframes

Week 3 - Class 3 - Project - Data Modelling and Planning



Dataset exploration results

columns :

  • 'Invoice/Item Number',
  • 'Date',
  • 'Store Number',
  • 'Store Name',
  • 'Address',
  • 'City'
  • 'Zip Code'
  • 'Store Location'
  • 'County Number'
  • 'County'
  • 'Category',
  • 'Category Name'
  • 'Vendor Number'
  • 'Vendor Name',
  • 'Item Number'
  • 'Item Description'
  • 'Pack',
  • 'Bottle Volume (ml)'
  • 'State Bottle Cost',
  • 'State Bottle Retail'
  • 'Bottles Sold',
  • 'Sale (Dollars)',
  • 'Volume Sold (Liters)'
  • 'Volume Sold (Gallons)'


  • Facts (sales)

DATA MODEL (SnowFlake Schema)


ETL Plan

  • Create a new schema for the large csv dataset using StructType y StructField
  • Read the .csv file from S3, and load the dataset using a dataframe using .Cache() or .Persist() with the already defined schema.
  • Be careful with date columns and columns with currency symbols
  • Write 6 queries in order to create 6 DIMENSIONS tables using the dataframe already persisted.
  • Write a query in order to create the fact table: FACT, the query will use the dataframe already persited.
  • Additionally you could add another job that works as check data quality to verify the data
  • After this exercise please delete the glue catalog tables, delete the created workgroup, delete the applications in the emr-studio, delete the s3 bucket folders and delete the created roles.


Apache Spark with python

License:MIT License


Language:Jupyter Notebook 88.3%Language:Python 11.3%Language:HTML 0.4%