josephmachado / spark_submit_airflow

Simple repo to demonstrate how to submit a spark job to EMR from Airflow

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How to submit Spark jobs to EMR cluster from Airflow

This is the repository for blog at How to submit Spark jobs to EMR cluster from Airflow.

Prerequisites

  1. docker (make sure to have docker-compose as well).
  2. git to clone the starter repo.
  3. AWS account to set up required cloud services.
  4. Install and configure AWS CLI on your machine.

Design

DAG Design

Data

From the project directory do

wget https://www.dropbox.com/sh/amdyc6z8744hrl5/AADS8aPTbA-dRAUjvVfjTo2qa/movie_review
mkdir ./dags/data
mv movie_review ./dags/data/movie_review.csv

Setup and run

If this is your first time using AWS, make sure to check for presence of the EMR_EC2_DefaultRole and EMR_DefaultRole default role as shown below.

aws iam list-roles | grep 'EMR_DefaultRole\|EMR_EC2_DefaultRole'
# "RoleName": "EMR_DefaultRole",
# "RoleName": "EMR_EC2_DefaultRole",

If the roles not present, create them using the following command

aws emr create-default-roles

Also create a bucket, using the following command.

aws s3api create-bucket --acl public-read-write --bucket <your-bucket-name>

Replace <your-bucket-name> with your bucket name. eg.) if your bucket name is my-bucket then the above command becomes aws s3api create-bucket --acl public-read-write --bucket my-bucket

and press q to exit the prompt

After use, you can delete your S3 bucket as shown below

aws s3api delete-bucket --bucket <your-bucket-name>

and press q to exit the prompt

docker-compose -f docker-compose-LocalExecutor.yml up -d

go to http://localhost:8080/admin/ and turn on the spark_submit_airflow DAG. You can check the status at http://localhost:8080/admin/airflow/graph?dag_id=spark_submit_airflow.

DAG

Terminate local instance

docker-compose -f docker-compose-LocalExecutor.yml down
aws s3api delete-bucket --bucket <your-bucket-name>

Contact

website: https://www.startdataengineering.com/

twitter: https://twitter.com/start_data_eng

About

Simple repo to demonstrate how to submit a spark job to EMR from Airflow

License:Apache License 2.0


Languages

Language:Python 50.3%Language:Shell 34.1%Language:Dockerfile 15.6%