m-doru / dslab-hw3

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Week 9: Homework 3

This is the final week of the Apache Spark series and consists of a graded project notebook found in this repository.

Due date: May 8, 18:00

As always, fork this repository into your own namespace. The contents of your forked repo on the due date will graded.

Getting started

The set up is the same as for week 7. We will be using notebooks on the iccluster.

if you already did this setup for week 7 you can safely skip to the next section!

The notebooks in this series will most easily be run on the IC Cluster. To set up your linux PATH and PYTHONPATH correctly, log in to your iccluster account and copy/paste the two commands below:

$ echo "export PATH=/opt/anaconda3/bin:$PATH" >> ~/.bash_profile
$ echo "export PYTHONPATH=/usr/hdp/current/spark2-client/python/lib/py4j-0.10.4-src.zip:/usr/hdp/current/spark2-client/python" >> ~/.bash_profile
$ source ~/.bash_profile

Starting notebooks

To begin this week's exercises, make a fork of this repository in your own namespace on gitlab and then clone the fork in your account on the iccluster:

$ git clone https://git-dslab.epfl.ch/<gitlab-username>/homework3-spark.git

Please make a copy of the EMPTY twitter-hashtags.ipynb notebook and add your name to the filename, for example twitter-hashtags-calvin.ipynb. Add this notebook to the repo and commit. This way if we need to push an update to the EMPTY notebook, you won't get merge conflicts.

In the notebooks, you will see TODO sections each worth a number of points.

To start this notebook, run

$ jupyter notebook --ip

Browse to the clone of the repository and open the notebook that you made above with your name in the filename.

If you close your browser window on accident and need to get back to your notebook, you can find the currently-running notebook servers with:

$ jupyter notebook list
Currently running servers: :: /home/roskar



Language:HTML 98.1%Language:Jupyter Notebook 1.9%