data engineering project - visualize visa numbers by country, time issued from japan
This is small data engineering project to learn how to install apache spark cluster on server, learn the workflow of interaction with apache spark/local machine via pyspark.
Original tutorial: https://www.youtube.com/watch?v=f-IcM8mFmDc&t=160s
Visualized map:
-
create venv in local project folder: python -m venv japan-visa-de
-
download dataset of japan visa csv file - https://www.kaggle.com/datasets/yutodennou/visa-issuance-by-nationality-and-region-in-japan
-
create vm in ec2 (t2.xlarge), download ssh key, move ssh key to project folder using "scp" cmd
-
chmod 400 your private_key.pem
-
install docker compose via image
-
run docker compose to bring up spark cluster
-
enable inbound rule in sec group in aws ec2 to see spark master web ui on port 9090
-
write pyspark code , upload on the spark cluster machine and execute using spark-submit
-
download back results of work to local machine - visualized images/html