pyspark_primer
Jupyter notebook of basic PySpark operations
Naming Convention
- Folders - small case, separated by dash (-)
- Files - small case, separated by underscore (_)
Setup
Pyspark on Jupyter for Mac/Linux Systems
Pre-Reqs
- Linux/MacOS enabled machine, windows 10 users can use the
linux subsystem for windows
available on MS store - Jupyter installed on machine
$HOME
pointing to home location on machine- Internet connection to download packages
Important Notes before setup
- Location of jupyter kernels can be found by running
jupyter kernelspec list
- Change location where pyspark kernel is copied by refering above where other kernels are installed
- In pyspark_kernel.json change the location of
SPARK_HOME
,PYTHONPATH
etc. based on your system
Instructions
- Clone the repository
- Copy pyspark_kernel.json, spark_installation.sh and spark_variable_setter.sh to
home
of your machinehome
location is stored in$HOME
variable in most systems
- Source spark_installation.sh file via
source spark_installation.sh
; it performs the following steps -- Updates system repository
- Checks and installs Java if not already present in the system (default jdk version)
- Creates directory
spark_download
in$HOME
- Download and unzip spark 2.4 and compatible hadoop in
$HOME/spark_download
- Stores environment variables like
SPARK_HOME
,PYTHONPATH
,JAVA_HOME
in local shell and in.bashrc
- Currently copies
pyspark_kernel.json
to$HOME/.local/share/jupyter/kernels/
- Exit and re-open the shell
- Typing
jupyter notebook
should start a new session - Pyspark should be visible in the list of avaiable kernels