erjan / data_engineering_japan_visas_pyspark

data enginerring project - visualize visa numbers by country, time issued from japan

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

data_engineering_japan_visas_pyspark

data engineering project - visualize visa numbers by country, time issued from japan

This is small data engineering project to learn how to install apache spark cluster on server, learn the workflow of interaction with apache spark/local machine via pyspark.

Original tutorial: https://www.youtube.com/watch?v=f-IcM8mFmDc&t=160s

Visualized map:

Screenshot_8

2nd map: Screenshot_13

  1. create venv in local project folder: python -m venv japan-visa-de

  2. download dataset of japan visa csv file - https://www.kaggle.com/datasets/yutodennou/visa-issuance-by-nationality-and-region-in-japan

  3. create vm in ec2 (t2.xlarge), download ssh key, move ssh key to project folder using "scp" cmd

  4. chmod 400 your private_key.pem

  5. install docker compose via image

  6. run docker compose to bring up spark cluster

  7. enable inbound rule in sec group in aws ec2 to see spark master web ui on port 9090

  8. write pyspark code , upload on the spark cluster machine and execute using spark-submit

  9. download back results of work to local machine - visualized images/html

About

data enginerring project - visualize visa numbers by country, time issued from japan

License:Apache License 2.0


Languages

Language:HTML 100.0%Language:Python 0.0%Language:Shell 0.0%