pyspark_primer

Jupyter notebook of basic PySpark operations

Naming Convention

Linux/MacOS enabled machine, windows 10 users can use the linux subsystem for windows available on MS store
Jupyter installed on machine
$HOME pointing to home location on machine
Internet connection to download packages

Location of jupyter kernels can be found by running jupyter kernelspec list
Change location where pyspark kernel is copied by refering above where other kernels are installed
In pyspark_kernel.json change the location of SPARK_HOME, PYTHONPATH etc. based on your system

Clone the repository
Copy pyspark_kernel.json, spark_installation.sh and spark_variable_setter.sh to home of your machine
- home location is stored in $HOME variable in most systems
Source spark_installation.sh file via source spark_installation.sh; it performs the following steps -
- Updates system repository
- Checks and installs Java if not already present in the system (default jdk version)
- Creates directory spark_download in $HOME
- Download and unzip spark 2.4 and compatible hadoop in $HOME/spark_download
- Stores environment variables like SPARK_HOME, PYTHONPATH, JAVA_HOME in local shell and in .bashrc
- Currently copies pyspark_kernel.json to $HOME/.local/share/jupyter/kernels/
Exit and re-open the shell
Typing jupyter notebook should start a new session
Pyspark should be visible in the list of avaiable kernels