Run Apache Spark On Amazon Athena

What's included
Set Up Work
Main Tutorial
Useful Links
Creators

What's included

The repo is to supplement the youtube video on running Apache Spark Workloads on Athena AWS.

You will need an AWS User that has permissions to access Athena, S3, and the Glue Data Catalog. I am using my Admin account to carry out the tutorial.

Data

Below is the schema for the customer table which is created in the Glue Data Catalog by the cloudformation template. The schema also contains some sample data.

Customers

Customerid	Firstname	Lastname	Fullname
293	Catherine	Abel	Catherine Abel
295	Kim	Abercrombie	Kim Abercrombie
297	Humberto	Acevedo	Humberto Acevedo

Set up

Run the cloud formation template. This will create;

The S3 bucket
Glue database
Customers table in the Glue database

Upload the data from the data folder

Main Tutorial

Show all database

# Show Databases
spark.sql("show databases").show()

Read data from Customers Table

# Read from the customers table in the glue data catalog 
sqlDF = spark.sql("SELECT * From athena_spark_tutorial_db.customers")

#Show top 50 rows
sqlDF.show(50)

Print the schema

# Check types in frame
sqlDF.printSchema()

Select Fields From A frame

# Selecting certain fields from a  DataFrame
sqlDF = spark.sql("SELECT Firstname From athena_spark_tutorial_db.customers")

#Show top 50 rows
sqlDF.show(50)

Creators

Johnny Chivers

https://github.com/johnny-chivers/

Useful Links

Enjoy 🤘

johnny-chivers / spark-athena