Data Preprocessing Workshop

Installation Instructions

Option 1: Docker

Install Docker.
Clone this repo.
cd into this repo.
Run docker-compose build.
To access the notebooks, run docker-compose up and go to the URL in the terminal output.

Docker Compose is automatically configured to launch JupyterLab. If you'd like to disable it, remove the following two lines from docker-compose.yml:

environment:
  - JUPYTER_ENABLE_LAB=1

and repeat steps 4 and 5.

Option 2: Virtualenv

Install virtualenv using pip3 (this workshop is specifically made for Python 3).
Clone this repo and cd into it.
Create a virtual environment with python -m venv venvname. Feel free to replace venvname with whatever you'd like to name the environment.
Type source venvname/bin/activate to activate the environment.
Run pip install -r requirements.txt.
Run jupyter notebook to launch Jupyter Notebook, or jupyter lab to launch JupyterLab.

Workshop Outline

Each bullet will include hands-on exercises.

Day 1

Section 1: First steps of preprocessing

Setup and introduction to preprocessing
Dealing with missing data
Exploring data types
Class distribution and imbalance

Section 2: Standardizing data for machine learning

What is standardization, and when should you standardize?
Log normalization
Scaling for feature comparison
Standardization and modeling

Day 2

Section 1: Extracting information from features

What is feature engineering?
Extracting features using regular expressions
Encoding variables
Aggregate statistics

Section 2: Feature selection

What is feature selection, and when should you manually remove features?
Removing correlated features
Using dimensionality reduction for feature selection
Using PCA to train a dataset

Section 3: UFO dataset (if we have time!)

Apply various preprocessing techniques to a dataset of UFO sightings and discuss as a group.

sarguido / data-preprocessing-workshop