jf20541 / Optimal-RegressionModel-HyperParameters-Flask-Azure-Docker

Seek the optimal regression model while optimizing model's hyperparameters using Tree-Structured Parzen Estimator Approach (TPE) that applies the Bayes Rule and evaluating performance with RMSE.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Optimal-RegressionModel-HyperParameters-Flask-Azure

Objective

Monitor and seek the optimal regression model (DecisionTreeRegressor, RandomForestRegressor, XGBRegressor, SVR, KNeighborsRegressor) and optimize each model's hyper-parameters using Tree-structured Parzen Estimator Approach (TPE) by iterating 1,000 trials. Evaluated the model's performance based on RMSE given different approaches by feature engineering (One-Hot Encoding, Target Encoding, etc) for the house prediction dataset.

Optimal Regression Model and Optimal Hyper-Parameters

RandomForest Regressor: TRAINING_CLEAN

Performance: 118098.32337847917. RMSE
Best hyperparameters: {'model_type': 'RandomForestRegressor', 'n_estimators': 618, 'max_depth': 12, 'min_samples_split': 19, 'min_samples_leaf': 2, 'max_features': 'auto'}

Repository File Structure

├── src          
│   ├── optimal_model.py        # Extract optimal Regression Model with its optimal hyper-parameter for deployment
│   ├── train_no_scaling.py     # Optimal (RandomForest, DecisionTree, XGBRegressor) without scaled data and optimized hyper-parameters
│   ├── train_scaling.py        # Optimal (KNearestNeighbor & SVM) with scaled data and optimized hyper-parameters.
│   └── config.py               # Define path as global variable
├── inputs
│   ├── train.csv               # Training dataset from Kaggle
│   ├── train_no_scale.csv      # No scaled dataset (featured engineered, and feature selection)
│   ├── train_scale.csv         # Scaled dataset (featured engineered, and feature selection)
│   └── train_clean.csv         # Cleaned data, featured engineered, scaled
├── templates
│   └── home.html               # HTML Code for front end deployment
├── statics
│   └── css
│       └── style.css           # Apply a unique style to a HTML elements
├── notebooks
│   └── house_price_eda.ipynb   # EDA, Feature Engineering, Feature Selection
├── requierments.txt            # Packages used for project
├── sources.txt                 # Sources
├── Dockerfile                  # Dockerize Flask Application 
└── README.md

Docker

 docker build -t optimalregressionapi .
 docker run -ti optimalregressionapi  
 source activate ml 
 
 # Train Optimal Regression Model with Optuna
 cd src/src
 python train_no_scaling.py
 python train_scaling.py
 python optimal_model.py
 
 # Deploy Model using Flask 
 cd .. 
 python app.py

Deployment with Flask/Azure

https://randomforestregressorhomeprediction.azurewebsites.net

Metric

Tree-structured Parzen Estimator Approach (TPE)

Objective
Build a probability model of the objective function and use it to select the optimal regression model and its hyper-parameters to evaluate its performance by minimizing RMSE on the testing set. With Bayesian approach, it keeps track of the past results which it uses to form a probabilistic model mapping hyper-parameters to a probability of a score on the objective function.

TPE is a model that applies the Bayes Rule given the equation below, where the probability of (y) RMSE score on the objective function given a set of hyper-parameters (x)

With TWO distributions for hyper-parameters

  1. l(x): the value of the objective function is less than the threshold
  2. g(x): the value of the objective function is greater than the threshold

Expected Improvement (EI) objective is to maximize the ratio below. The objective function records the results and its hyper-parameters, forming a history. TPE works by iterating pairs to form l(x), evaluates each ratios l(x)/g(x), and returns the highest Expected Improvement.

Data

Kaggle Dataset

Target  
Price                  int64

Features: 
Bedrooms               int64
Bathrooms              int64
Sqft_living            int64
sqft_lot               int64
floors                 int64
waterfront             int64
view                   int64
condition              int64
sqft_above             int64
sqft_basement          int64
yr_built               int64
yr_renovated           int64
city                   int64
statezip               int64

About

Seek the optimal regression model while optimizing model's hyperparameters using Tree-Structured Parzen Estimator Approach (TPE) that applies the Bayes Rule and evaluating performance with RMSE.

License:MIT License


Languages

Language:Jupyter Notebook 98.6%Language:Python 0.8%Language:CSS 0.5%Language:HTML 0.1%Language:Dockerfile 0.0%