maprihoda/data-analysis-with-python-and-pyspark

apache-spark data-analysis data-science data-wrangling dataframe-api machine-learning pyspark python sql

Learning Pyspark locally (i.e. without using any cloud service) via following the excellent Data Analysis with Python and PySpark by Jonathan Rioux.

Environment setup

From the project root, run:

pipenv install

This will create a virtual environment with all the required dependencies installed.

Although only pipenv is required for this setup to run, I strong recommend having both pyenv and pipenv installed. pyenv manages Python versions while pipenv takes care of virtual environments.

If you're on Windows, try pyenv-win. pipenv should work just fine.

The notebooks were created with Visual Studio Code's Jupyter code cells, which I prefer over standard Jupyter notebooks/labs because of much better git integration.

You can easily convert the code cells files into Jupyter notebooks with Visual Studio Code. Just open a file, right click and select Export current Python file as Jupyter notebook.

The data directory contains only the smaller-sized data files. You will have to download the larger ones as per the instructions in the individual notebooks, e.g.:

home_dir = os.environ["HOME"]
DATA_DIRECTORY = os.path.join(home_dir, "Documents", "spark", "data", "backblaze")

This works on my Linux machine. You may need to modify the path if you're on Windows.

About

apache-spark data-analysis data-science data-wrangling dataframe-api machine-learning pyspark python sql

Languages

Language:Python 100.0%