Monitor and seek the optimal regression model (DecisionTreeRegressor, RandomForestRegressor, XGBRegressor, SVR, KNeighborsRegressor) and optimize each model's hyper-parameters using Tree-structured Parzen Estimator Approach (TPE) by iterating 1,000 trials. Evaluated the model's performance based on RMSE given different approaches by feature engineering (One-Hot Encoding, Target Encoding, etc) for the house prediction dataset.
RandomForest Regressor: TRAINING_CLEAN
Performance: 118098.32337847917. RMSE
Best hyperparameters: {'model_type': 'RandomForestRegressor', 'n_estimators': 618, 'max_depth': 12, 'min_samples_split': 19, 'min_samples_leaf': 2, 'max_features': 'auto'}
├── src
│ ├── optimal_model.py # Extract optimal Regression Model with its optimal hyper-parameter for deployment
│ ├── train_no_scaling.py # Optimal (RandomForest, DecisionTree, XGBRegressor) without scaled data and optimized hyper-parameters
│ ├── train_scaling.py # Optimal (KNearestNeighbor & SVM) with scaled data and optimized hyper-parameters.
│ └── config.py # Define path as global variable
├── inputs
│ ├── train.csv # Training dataset from Kaggle
│ ├── train_no_scale.csv # No scaled dataset (featured engineered, and feature selection)
│ ├── train_scale.csv # Scaled dataset (featured engineered, and feature selection)
│ └── train_clean.csv # Cleaned data, featured engineered, scaled
├── templates
│ └── home.html # HTML Code for front end deployment
├── statics
│ └── css
│ └── style.css # Apply a unique style to a HTML elements
├── notebooks
│ └── house_price_eda.ipynb # EDA, Feature Engineering, Feature Selection
├── requierments.txt # Packages used for project
├── sources.txt # Sources
├── Dockerfile # Dockerize Flask Application
└── README.md
docker build -t optimalregressionapi .
docker run -ti optimalregressionapi
source activate ml
# Train Optimal Regression Model with Optuna
cd src/src
python train_no_scaling.py
python train_scaling.py
python optimal_model.py
# Deploy Model using Flask
cd ..
python app.py
https://randomforestregressorhomeprediction.azurewebsites.net
Objective
Build a probability model of the objective function and use it to select the optimal regression model and its hyper-parameters to evaluate its performance by minimizing RMSE on the testing set. With Bayesian approach, it keeps track of the past results which it uses to form a probabilistic model mapping hyper-parameters to a probability of a score on the objective function.
TPE is a model that applies the Bayes Rule given the equation below, where the probability of (y) RMSE score on the objective function given a set of hyper-parameters (x)
With TWO distributions for hyper-parameters
- l(x): the value of the objective function is less than the threshold
- g(x): the value of the objective function is greater than the threshold
Expected Improvement (EI) objective is to maximize the ratio below. The objective function records the results and its hyper-parameters, forming a history. TPE works by iterating pairs to form l(x), evaluates each ratios l(x)/g(x), and returns the highest Expected Improvement.
Target
Price int64
Features:
Bedrooms int64
Bathrooms int64
Sqft_living int64
sqft_lot int64
floors int64
waterfront int64
view int64
condition int64
sqft_above int64
sqft_basement int64
yr_built int64
yr_renovated int64
city int64
statezip int64