SPARK-LAB

This project aims to ingest and clean airbnb public data via AWS Glue and Local PySpark

Glue Configurations

You can access to aws console to create a new glue job or deploy via an IaC tool like Cloudformation or Terraform...

I created a service role which is able to access to s3 bucket and Cloudwatch Log in Read and Write Mode. I have allocated 2 workers for my glue job. I am using Glue 3.0

Using glue dynamic frame

Result: Ingest CSV File of 14 MB in 1 Minutes 5 Seconds

Using spark dataframe

Result: Ingest CSV File of 14 MB in 1 Minutes 16 Seconds

Run Spark in Local Standalone Mode

Requirements: ``

Java 8
Spark-3.1.3-bin-hadoop3.2
Python 3.9 (Anaconda if you are familiar with conda)
pyspark from
py4j
findspark (not required) ``

Run Spark Scala/Python in Local

You can test the installation with my juypter notebook

Get AWS Credentials to interact with aws resources

Official Docs

About

Languages

Language:Jupyter Notebook 96.7%Language:Python 3.3%