Lecture notes for the "Data Science, the Pythonic way" @ Develer
@leriomaggio | valeriomaggio | valeriomaggio_at_gmail_dot_com |
git clone https://github.com/leriomaggio/develer-data-science.git
(from apprentice to doctor strange)
-
Level I) Apprentice: Pythonic tools for Data Science
- Dev Tools for Data Scientist and Jupyter notebooks
- Numerical computation in Python:
numpy
- Working with data:
pandas
-
Level II) Alchemist: Data Visualisation
- Basic principles of data visualisation
- Introduction to
matplotlib
- interactive data visualisation using
bokeh
-
Level III) Mage: Crash course on Machine Learning
- What is Machine Learning
- Introduction to
sklearn
- Supervised and Unsupervised Machine learning
- Robust Machine Learning: selection bias and cross-validation
-
Level IV) Arch-Mage : Deep Learning & Pythonic perspectives
- What is Deep Learning
- Deep Learning frameworks
- Introduction to Keras
The course will be organised in four different parts, mostly covering the basics (plus some more advanced topics) related to Machine Learning and Data Science.
We will start by introducing the basics of data science in Python, and the (development) tools and frameworks to be used. Then we will start working with real data (in different formats) to have a very general feeling of what does it mean to be a data scientist. There will also be a section specifically focused on basic principles (and tools) of data visualisation. Finally, more advanced concepts will be introduced. In particular, a general introduction to Machine Learning models and settings (i.e. supervised and unsupervised) will be provided, along with a glimpse of Deep learning models and frameworks.
All these parts will be presented always considering the perspective of the developer and practitioner who wants to learn (and understand) Data Science in a very practical way. For this aim, the materials will contain lots of exercises and challenges along the way to test your skills.
This tutorial requires the following packages:
- Python version 3.6
- Python 3.4+ should be fine as well
- likely Python 2.7 would be also fine, but who knows? :P
numpy
: http://www.numpy.org/scipy
: http://www.scipy.org/matplotlib
: http://matplotlib.org/pandas
: http://pandas.pydata.orgscikit-learn
: http://scikit-learn.orgjupyter
¬ebook
: http://jupyter.org
Plus - for the last Deep learning section:
keras
: http://keras.iotensorflow
: https://www.tensorflow.org- (optional)
torch
: http://pytorch.org
The easiest way to get (most of) these is to use an all-in-one installer such as Anaconda from Continuum, which is available for multiple computer platforms, namely Linux, Windows, and OSX.
I'm currently running this tutorial with Python 3 on Anaconda
$ python --version
Python 3.6.6
If you want to access the materials, you have several options:
Most of the materials in this course is provided as a collection of Jupyter Notebooks.
In case you don't know what is a Jupyter notebook, here is a good reference for a quick introduction: Jupyter Notebook Beginner Guide.
On the other hand, if you also want to know (and you should) what is NOT a Jupyter notebook - spoiler alert: it is NOT an IDE - here is a very nice reference:
→ I Don't like Notebooks, by Joel Grus @ JupyterCon 2018.
If you already have all the environment setup on your machine, all you need to do is to run the Jupyter notebook server:
$ jupyter notebook
Alternatively, I suggest you to try the new Jupyter Lab environment:
$ jupyter lab
NOTE: Before running Jupyter server, it is mandatory to enable the (Python) virtual environment.
Please refer to the section Setting the Environment for detailed instructions on how to install all the required packages and libraries.
(Consider this option only if your WiFi is stable)
If you don't want the hassle of setting up all the environment and libraries on your machine, or simply you want to avoid doing "too much computation" on your hardware setup, I strongly suggest you to use the Binder service.
The primary goal of Binder is to turn a GitHub repo into a collection of interactive Jupyter notebooks
To start using Binder, just click on the button below:
Colaboratory is a free Jupyter notebook environment that requires no setup and runs entirely in the Google cloud. Moreover, GPU and TPU runtime environments are available, and completely for free. (This last option will be worthwhile mentioning in the very last part of the course, when we will talk about Deep Learning networks).
Here is an overview of the main features offered by Colaboratory.
To start using Colaboratory, just click on the button below:
In this repository, files to install the required packages are provided. The first step to setup the environment is to create a Python Virtual Environment.
Whether you are using Anaconda Python Distribution or the Standard Python framework (from python.org), below are reported the instructions for the two cases, respectively.
This repository includes a conda-environment.yml
file that is necessary
to re-create the Conda virtual environment.
To re-create the virtual environments:
$ conda env create -f conda-environment.yml
Then, to activate the virtual environment:
$ conda activate develer-science
Alternatively, if you don't want to install (yet) another Python
distribution on your machine, or you prefer not to use the full-stack Anaconda
Python, I strongly suggest to give a try to the new pyenv
project.
pyenv
is a new package that lets you easily switch between multiple
versions of Python.
It is simple, unobtrusive, and follows the UNIX tradition of single-purpose
tools that do one thing well.
To setup pyenv
, please follow the instructions reported on the
GitHub Repository of the project,
according to the specific platform and operating system.
There exists a pyenv
plugin named pyenv-virtualenv
which comes with various
features to help pyenv
users to manage virtual environments created by
virtualenv
or Anaconda.
I would recommend to install pyenv-virtualenv
as reported in
the official
documentation.
Once pyenv
and pyenv-virtualenv
have been correctly installed and
configured, these are the instructions to
set up the virtual environment for this tutorial:
$ pyenv install 3.6.6 # downloads and enables Python 3.6
$ pyenv virtualenv 3.6.6 develer-science # create virtual env using Py3.6
$ pyenv activate develer-science # activate the environment
$ pip install -r requirements.txt # install requirements
All the notebooks in this tutorial have been saved using a Jupyter Kernel defined on the created virtual environment, named "Python 3.6 (DL Keras TF)".
In case you got a warning of non-existent kernel when you open the
notebooks on your machine, you need to create the corresponding
IPython
kernel:
$ python -m ipykernel install --user --name develer-science --display-name "Python 3.6 (Develer Science)"
>>> import numpy as np
>>> import scipy as sp
>>> import pandas as pd
>>> import matplotlib.pyplot as plt
>>> import sklearn
>>> import keras
Using TensorFlow backend.
>>> import numpy
>>> print('numpy:', numpy.__version__)
>>> import scipy
>>> print('scipy:', scipy.__version__)
>>> import matplotlib
>>> print('matplotlib:', matplotlib.__version__)
>>> import sklearn
>>> print('scikit-learn:', sklearn.__version__)
numpy: 1.15.2
scipy: 1.1.0
matplotlib: 3.0.0
scikit-learn: 0.20.0