AashitaK/ML-Workshops

Introduction:

The workshop series is designed with a focus on the practical aspects of machine learning. We will be working in Python and using real-world datasets from Kaggle, the machine learning platform most suited for the “learn-by-doing” philosophy. The series is targeted towards complete beginners familiar with Python, but it is also designed adaptively so that you will be challenged even if you have some familiarity with machine learning tools.

The four-session workshop is going to be very hands-on and will focus on how to work with datasets. Instead of comprehensively covering every tool and concept, you will learn the minimal but most useful tools and concepts quickly and learn how to find resources to explore further.

Timeline:
Session 1: 5:30-7:30 pm on Thursday March 28, 2019 at Aviation Room, HMC
Session 2: 5:30-7:30 pm on Thursday April 4, 2019 at Shan 2454, HMC
Session 3: 5:30-7:30 pm on Thursday April 11, 2019 at Aviation Room, HMC
Session 4: 5:30-7:30 pm on Thursday April 18, 2019 at Shan 2454, HMC

This series is a precursor to a future Deep Learning workshop series.

General structure of each two-hour session in the workshop series:

Guided session
Hands-on exercise
Project work

Four sessions are planned in the series with the following time allocations:

Sessions	Guided session (min)	Hands-on exercise (min)	Total time (min)
1	50	70	120
2	30	90	120
3	40	80	120
4	90	30	120

Topics covered in the guided sessions and hands-on exercises:

Session 1: Exploratory Data Analysis and Feature Engineering using Pandas - 1

Pandas dataframes as the data structure for datasets
Converting csv files to dataframes
Slicing and indexing dataframes using conditionals as well as iloc and loc methods.
Statistical summary and exploration of dataframes
Detecting and filling missing values in the dataframes
Regular expressions for data extraction
Feature engineering such as creating new features
Basic statistical plots using matplotlib and seaborn
Correlation among features
Basic operations such as dropping rows/columns, setting index, replacing values of a column using a dictionary, etc.

Session 2: Exploratory Data Analysis and Feature Engineering using Pandas - 2

Split-apply-combine operations by grouping rows of a dataframe
Encoding categorical variables
Concatentating and merging dataframes
More operations such as sorting the rows, creating a dataframe from the scratch, etc.

Session 3: Model Building, Tuning and Validation using Scikit-learn - 1

Overfitting and underfitting of models
Regression algorithms
- Linear Regression
- Polynomial Regression
- Rigde Regression
- Lasso Regression
Model Validation
Tuning regularization paramter
Evaluation metrics for regression - R-squared and Root Mean-Squared Error (RMSE)
Normalization and scaling of features

Session 4: Model Building, Tuning and Validation using Scikit-learn - 2

Classification algorithms
- Logistic Regression
- Decision Trees
- k-Nearest Neighbors
- Support Vector Machines
- Random Forests
Evaluation metrics for classification
- Classification accuracy
- Confusion matrix
- Decision Threshold
- Precision and Recall
- F1 score
- Area Under ROC curve
Dimensionality reduction (Optional)
- Principal Component Analysis (PCA)
k-fold Cross-validation
Maximum Voting Classifiers

Pre-requisites:

Python programming basics (HMC CS-5 or equivalent should suffice)
Some familiarity with common statistical concepts (HMC MATH-35 or equivalent should suffice)

Learning materials:

The learning material is shared in the Github repository. You can download the entire repository and run the notebooks in your system by installing Jupyter notebooks using Anaconda distribution with python 3 version. Another option would be to fork the notebooks from the following links and run it using Kaggle Kernels - a cloud computing environment that does not require any installation.

Session 1
Session 2
- Guided Session 2
- Exercise 2
Session 3
- Guided Session 3
- Regression algorithms
- Exercise 3 (Bike Share Demand competition)
Session 4
- Classification algorithms
- Miscellaneous concepts in Machine Learning
- Exercise 4 (Titanic competition)

The solutions for the guided sessions and exercise notebooks are available in the Github repository but not on Kaggle. The material is designed to be self-sufficient and useful in case you miss a session.

Team:

Instructor: Aashita Kesarwani
TAs: Rex Asabor, Ben Langton and Qualan Woodard

Seats are limited, please register using this link. It is important that you attend all four sessions of the series for it to be useful.

AashitaK / ML-Workshops