riyadparvez / pyspark-datascience

PySpark notebooks

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

pyspark-notebooks

PySpark Jupyter notebooks

Installation

We provide a pre-built docker image for easy experimentation. The docker image is based on offical jupyter pyspark-notebook image. Some additional packages have been installed.

To pull the image: docker pull riyadparvez/pyspark-notebooks

To run a container: docker run --rm -p 8888:8888 -p 8080:8080 -p 4040:4040 -v /path/to/pyspark-notebooks:/home/jovyan/work --name pyspark-notebook riyadparvez/pyspark-notebooks start-notebook.sh --NotebookApp.token=''

Please see the documentation of official jupyter docker image for more usage.

Notebooks

Most of the notebooks are WIP. Complete notebooks are:

Datasets

Most of the notebooks are from Kaggle competitions or datasets from University of California at Irvine Machine Learning Repository. For UCI repositories, data are downloaded automatically in notebooks themselves. But for Kaggle datasets, you have to download the datasets yourself, since there is not good automated way to download those datasets.

About

PySpark notebooks

License:MIT License


Languages

Language:Jupyter Notebook 98.7%Language:Python 1.3%