jf20541 / LogisticRegressionPyTorch

Predict binary values using a Log-Reg with PyTorch and from scratch. Used Optuna to find the optimal model (SVM, Decision-Tree, Log-Reg) and seek optimal hyper-parameters.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

LogisticRegression

Objective

Logistic Regression using Pytorch and from scratch to determine the impact of multiple independent variables presented simultaneously to predict binary target values [1: RainTomorrow, 0:No RainTomorrow]. Tested multiple ML models (Support Vector Machines, Decision Trees, and Logistic Regression) while optimizing each model's hyper-parameters since the original model performed poorly (78.15%)

Output

Optimal Model for Dataset

Optimal Model: Logistic Regression 

Trial 2 finished with value: 86.99%
Parameters: 'penalty': l2, 'logistic-regularization': 9.377701438670483

Logistic Regression using PyTorch

Logistic Regression using Pytorch Accuracy: 78.15%

Logistic Regression from scratch

Mean Logistic Regression Accuracy: 77.89%

Repository File Structure

├── src          
│   ├── pytorchmodel.py      # Logistic Regression using PyTorch and evaluate metric
│   ├── optimal_model.py     # Optimal model compared Logistic-Reg, SVM, and Decision Tree
│   ├── models.py            # Logistic Regression from scratch
│   ├── train.py             # Initiated the model, evaluate metric and initializing Argument Parser Class
│   ├── create_folds.py      # Implemented a cross-validation set
│   ├── data.py              # Cleaned the data and feature engineer
│   └── config.py            # Define path as global variable
├── inputs
│   ├── train.csv            # Training dataset
│   └── train_folds.csv      # K-Fold dataset LR_fold0.bin
├── models                   # Saving/Loading models parameters
│   ├── LR_fold0.bin
│   ├── LR_fold1.bin
│   ├── LR_fold2.bin 
│   ├── LR_fold3.bin 
│   └── LR_fold4.bin
├── requierments.txt         # Packages used for project
├── sources.txt              # Sources
└── README.md

Model

Supervised-Learning method for binary classification. It uses a sigmoid function (σ) to model a curve where the predictor domain features can be conditional probability between [0,1]. Logistic refers to the log-odds probability model, which is the ratio of the probability that an event occurs to the probability that it doesn't occur, given in the equation below.


\

Metric & Mathematics

It uses the Maximum Likelihood Estimation (MLE) to find the optimal parameters. For labels [0, 1] it estimates parameters such that the product of all conditional probabilities of class [0,1] samples are as close to maximum value [0,1].

Combine the products, take the log-likelihood, and convert into summation

Substitute p(x_1) with it's exponent form, group the coefficients of y_i and simplify to optimize beta coefficient that maximizes this function.

Data Features and Target

Kaggle's Weather Data

Target
RainTomorrow    float64

Features
MinTemp         float64
MaxTemp         float64
Rainfall        float64
Humidity9am     float64
Humidity3pm     float64
Pressure9am     float64
Pressure3pm     float64
Temp9am         float64
Temp3pm         float64
RainToday       float64
Year              int64
Month             int64

About

Predict binary values using a Log-Reg with PyTorch and from scratch. Used Optuna to find the optimal model (SVM, Decision-Tree, Log-Reg) and seek optimal hyper-parameters.

License:MIT License


Languages

Language:Python 98.7%Language:Shell 1.3%