Machine-Learning-Projects

A Repo for my AI, Data Science and Machine Learning Projects

Content

California Housing Prices Predictor
MNIST Digits Classifier- (Rite of Passage)
Titanic Survival Predictor
Spam Message Classifier
Description of some Machine Learning Algorithms

Project 1: California Housing Prices Predictor

This data has metrics such as the population, median income, median housing price, and so on for each block group in California. Block groups are the smallest geographical unit for which the US Census Bureau publishes sample data (a block group typically has a population of 600 to 3,000 people). We will just call them “districts” for short. This model should learn from this data and be able to predict the median housing price in any district, given all the other metrics.

Steps

Look at the Big Picture
Get the Data
Discover and visualize the data to gain insights.
Prepare the Machine Learning Algorithms
Select the model and train it.
Fine-tune your model
Present your solution
Launch, monitor and maintain your system.

I followed all these steps in the jupyter notebook and I clearly explained all the insights

Insights Obtained

I used several models for this predictor but I settled on a Random forest Regressor where I selected the 15 best features and and the best hyperparameters using a random search CV. I pickled the best model and the best SVM I made.

Project 2: MNIST Digits Classifier

We would be using the MNIST Dataset here, which is a set of 70,000 small images of digits handwritten by high school students and employees of the US Census Bureau. Each image is labelled with the digit it represents. The dataset is already split into a well stratified 60,000 training set and 10,000 test set.

The Dataset was fetched using SciKit-Learn OpenML function (fetch_openml("mnist_784", version=1))

Some Digits from the Dataset

Link to Notebook

I built different types of classifiers first

Binary Classifier for digit 5
Multclass classifier (using both OvA and OvO) using SVM SVC (Resorts to OvO by default), SGD Classfifier
Multilabel Clasifier using a KNN Clasiffier
MultiOutput- Multiclass classfier to remove noise from a noisy digit image
A Multiclass Classfier for the MNIST dataset with over 97% Accuracy using a KNN Classifier and a GridSearchCV to select the best hyperparameters
Performed Data Augmentation to the MNIST Dataset (Added Shifted images for every digit) then I trained another KNN Clasiffier. This gave the highest accuracy.

I used classification metrics like Cross Validation, Confusion matrix, Precision/Recall Tradeoff (Precision score, Recall score and threshold),ROC curve and roc auc score, Error analysis methods after plotting the confusion matrix.

Some Plots

Precision vs Recall

Recall vs Threshold

ROC curve

Error Analysis (The Diagonal was removed, the brighter values shows the highest inaccuracies)

Project 3: Titanic Survival Predictor

Link to Dataset

Problem Statement: The goal is to predict whether or not a passenger survived based on attributes such as their age, sex, passenger class, where they embarked and so on. This is a Kaggle competition and the dataset has already been split into a Training and Test set. The Test set doesn't contain the Labels.

Output:

An CSV containing the predicted value of "True" or "False", representing the "Survived" Label in the dataset. This CSV can be submitted on Kaggle to get your accuracy score.

Methodology

After peeking at the data, I discovered that Age, Cabin and Embarked have some missing data. I dropped Cabin entirely and replaced the missing data of Age with median age after careful analysis.
I preprocessed the numerical and categorical data seperately. I converted the categorical data into a numerical form using OneHotEncoder
I then trained an Random Forest Classifier using 100 estimators and evaluated it (I acheieved an accuracy of 81.6% on the test set)
Next I tried a Support Vector Classifier and after cross validation I achieved 82.6% accuracy .
I then engineered some features:
- Converted numerical attributes to categorical attributes: for example, different age groups had very different survival rates (see below), so I created an age bucket category and use it instead of the age.
- have a special category for people traveling alone since only 30% of them survived (see below).
- Replace SibSp and Parch with their sum.
Then I tuned my SVC hyperparameters using a GridSearch until I achieved a accuracy of 82.71%.
I will try other models, evaluate my engineered features and continue some tuning hyperparameters

Project 4: Spam Message Classifier

(Almost complete...Finishing Touches)

Methodology

Obtained the dataset from this Root Directory , but these are the specific links to the spam and ham mails.
I parsed the Emails and obtained the text from the raw html files using regular expressions. (For a more robust solution, you can use BeautifulSoup Library).
I carried out Stemming using the Natural Language Toolkit Module (NLTK).
I created transformers to extract extra features for example: EmailsToWordCounter and WordCounterToVector. Extracting words works for many written languages like English. A vocabulary was obtained and word occurences was compared to the vocabulary.
Transformed the entire dataset using a Transformer pipeline consisting of EmailsToWordCounter and WordCounterToVector.
I Trained a Logistic Regression model and upon evaluation on the Training set, I obtained an accuracy of 98.5%.
Evaluation on the test set gives:

Precision: 96.88%
Recall: 97.89%

I also explored using the BeautifulSoup Library for better parsing stability.

Project 5: Description of some Machine Learning Algorithms

This notebook is dedicated to exploring and describing different Machine learning algorithms. I will discuss the following here:

1. Linear Regression

SVD and or Normal Equation
Gradient Descent
Mini Batch Gradient Descent

Linear Model Prediction Plot

Gradient Descent Paths Plot

2. Polynomial Regression (Using Linear models to fit non-linear data)

Polynomial Regression (Quadratic Plot demonstrated)

3. Regularization

Ridge Regression
Lasso Regression
Elastic Net
Early Stopping (Very common in Gradient Descent)

Ridge Regression Plot

Lasso Regression Plot

Early Stopping

4. Logistic Regression (Using Probabilities)

Softmax Regression - A method to use Logistic Regression for Multinomial/ Multiclass Classification.

Logisitic Regression Demonstration

(Work in Progress)

uzoochogu / Machine-Learning-Projects

Machine-Learning-Projects

Content

Project 1: California Housing Prices Predictor

Steps

Insights Obtained

Project 2: MNIST Digits Classifier

Some Plots

Project 3: Titanic Survival Predictor

Output:

Methodology

Project 4: Spam Message Classifier

Methodology

Project 5: Description of some Machine Learning Algorithms

1. Linear Regression

Linear Model Prediction Plot

Gradient Descent Paths Plot

2. Polynomial Regression (Using Linear models to fit non-linear data)

Polynomial Regression (Quadratic Plot demonstrated)

3. Regularization

Ridge Regression Plot

Lasso Regression Plot

Early Stopping

4. Logistic Regression (Using Probabilities)

Logisitic Regression Demonstration

About

Languages