leriomaggio / develer-data-science

Deep dive into Data Science with Python @ Develer

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Develer turns to Data Science

Lecture notes for the "Data Science, the Pythonic way" @ Develer

Author: Valerio Maggio

PostDoc Data Scientist @ FBK/MPBA

Contacts:

@leriomaggio valeriomaggio valeriomaggio_at_gmail_dot_com

Materials:

github

git clone https://github.com/leriomaggio/develer-data-science.git

Outline at a glance:

(from apprentice to doctor strange)

  • Level I) Apprentice: Pythonic tools for Data Science

    • Dev Tools for Data Scientist and Jupyter notebooks
    • Numerical computation in Python: numpy
    • Working with data: pandas
  • Level II) Alchemist: Data Visualisation

    • Basic principles of data visualisation
    • Introduction to matplotlib
    • interactive data visualisation using bokeh
  • Level III) Mage: Crash course on Machine Learning

    • What is Machine Learning
    • Introduction to sklearn
    • Supervised and Unsupervised Machine learning
    • Robust Machine Learning: selection bias and cross-validation
  • Level IV) Arch-Mage : Deep Learning & Pythonic perspectives

    • What is Deep Learning
    • Deep Learning frameworks
    • Introduction to Keras

Description

The course will be organised in four different parts, mostly covering the basics (plus some more advanced topics) related to Machine Learning and Data Science.

We will start by introducing the basics of data science in Python, and the (development) tools and frameworks to be used. Then we will start working with real data (in different formats) to have a very general feeling of what does it mean to be a data scientist. There will also be a section specifically focused on basic principles (and tools) of data visualisation. Finally, more advanced concepts will be introduced. In particular, a general introduction to Machine Learning models and settings (i.e. supervised and unsupervised) will be provided, along with a glimpse of Deep learning models and frameworks.

All these parts will be presented always considering the perspective of the developer and practitioner who wants to learn (and understand) Data Science in a very practical way. For this aim, the materials will contain lots of exercises and challenges along the way to test your skills.


Technical Requirements

This tutorial requires the following packages:

Plus - for the last Deep learning section:

The easiest way to get (most of) these is to use an all-in-one installer such as Anaconda from Continuum, which is available for multiple computer platforms, namely Linux, Windows, and OSX.


Python Version

I'm currently running this tutorial with Python 3 on Anaconda

$ python --version
Python 3.6.6

Accessing the materials

If you want to access the materials, you have several options:

Jupyter Notebook

Most of the materials in this course is provided as a collection of Jupyter Notebooks.

In case you don't know what is a Jupyter notebook, here is a good reference for a quick introduction: Jupyter Notebook Beginner Guide.

On the other hand, if you also want to know (and you should) what is NOT a Jupyter notebook - spoiler alert: it is NOT an IDE - here is a very nice reference:

I Don't like Notebooks, by Joel Grus @ JupyterCon 2018.

If you already have all the environment setup on your machine, all you need to do is to run the Jupyter notebook server:

$ jupyter notebook

Alternatively, I suggest you to try the new Jupyter Lab environment:

$ jupyter lab

NOTE: Before running Jupyter server, it is mandatory to enable the (Python) virtual environment.

Please refer to the section Setting the Environment for detailed instructions on how to install all the required packages and libraries.

Binder

(Consider this option only if your WiFi is stable)

If you don't want the hassle of setting up all the environment and libraries on your machine, or simply you want to avoid doing "too much computation" on your hardware setup, I strongly suggest you to use the Binder service.

The primary goal of Binder is to turn a GitHub repo into a collection of interactive Jupyter notebooks

To start using Binder, just click on the button below: Binder

Google Colaboratory

Colaboratory is a free Jupyter notebook environment that requires no setup and runs entirely in the Google cloud. Moreover, GPU and TPU runtime environments are available, and completely for free. (This last option will be worthwhile mentioning in the very last part of the course, when we will talk about Deep Learning networks).

Here is an overview of the main features offered by Colaboratory.

To start using Colaboratory, just click on the button below: Colab


Setting the Environment

In this repository, files to install the required packages are provided. The first step to setup the environment is to create a Python Virtual Environment.

Whether you are using Anaconda Python Distribution or the Standard Python framework (from python.org), below are reported the instructions for the two cases, respectively.

(a) Conda Environment

This repository includes a conda-environment.yml file that is necessary to re-create the Conda virtual environment.

To re-create the virtual environments:

$ conda env create -f conda-environment.yml

Then, to activate the virtual environment:

$ conda activate develer-science

(b) pyenv & virtualenv

Alternatively, if you don't want to install (yet) another Python distribution on your machine, or you prefer not to use the full-stack Anaconda Python, I strongly suggest to give a try to the new pyenv project.

1. Setup pyenv

pyenv is a new package that lets you easily switch between multiple versions of Python. It is simple, unobtrusive, and follows the UNIX tradition of single-purpose tools that do one thing well.

To setup pyenv, please follow the instructions reported on the GitHub Repository of the project, according to the specific platform and operating system.

There exists a pyenv plugin named pyenv-virtualenv which comes with various features to help pyenv users to manage virtual environments created by virtualenv or Anaconda.

2. Installing pyenv-virtualenv

I would recommend to install pyenv-virtualenv as reported in the official documentation.

3. Setting up the virtual environment

Once pyenv and pyenv-virtualenv have been correctly installed and configured, these are the instructions to set up the virtual environment for this tutorial:

$ pyenv install 3.6.6  # downloads and enables Python 3.6
$ pyenv virtualenv 3.6.6 develer-science  # create virtual env using Py3.6
$ pyenv activate develer-science  # activate the environment
$ pip install -r requirements.txt  # install requirements

Installing Jupyter Kernel (Optional)

All the notebooks in this tutorial have been saved using a Jupyter Kernel defined on the created virtual environment, named "Python 3.6 (DL Keras TF)".

In case you got a warning of non-existent kernel when you open the notebooks on your machine, you need to create the corresponding IPython kernel:

$ python -m ipykernel install --user --name develer-science --display-name "Python 3.6 (Develer Science)"

Test if everything is up&running

1. Check import

>>> import numpy as np
>>> import scipy as sp
>>> import pandas as pd
>>> import matplotlib.pyplot as plt
>>> import sklearn
>>> import keras
Using TensorFlow backend.

2. Check installed Versions

>>> import numpy
>>> print('numpy:', numpy.__version__)
>>> import scipy
>>> print('scipy:', scipy.__version__)
>>> import matplotlib
>>> print('matplotlib:', matplotlib.__version__)
>>> import sklearn
>>> print('scikit-learn:', sklearn.__version__)
    numpy: 1.15.2
    scipy: 1.1.0
    matplotlib: 3.0.0
    scikit-learn: 0.20.0

If everything worked till down here, you're ready to start!

About

Deep dive into Data Science with Python @ Develer

License:MIT License


Languages

Language:Jupyter Notebook 99.8%Language:Python 0.2%