Machine learning to SQL
Background
Due to SQL being the main language being used for data manipulation and thus has extensive support in terms of compute and scheduling, why not perform inference with a machine learning model written in SQL code? The big limitation here is SQL itself, that's why we attempt to use machine learning models which have a simple structure it is writable in SQL. One additional benefit of this is that the model is interpretable, if you can write down the model in a basic logical language (SQL) you should be able to understand it (with limitation ofcourse).
This project tries to make the process simple enough for any SQL user to train a model, check the performance and deploy that model in SQL.
Current state
- Only EBM is implemented (decision tree, logistic regression and rule set not yet)
- Automated model training is working for binary classification and regression.
- SQL creation of model is working fully for binary clasification.
- SQL for regression and the whole process for multiclass classification is wip.
Pre requisites
- Create virtual environment and install packages, on mac run:
python3 -m venv .ml2sql source .ml2sql/bin/activate pip install -r requirements.txt
How to use main script
- Save csv file containing target and all features in the
input/data/
folder - Save a settings json file in the
input/configuration/
(explained below atConfiguration json
) - In the terminal run:
bash run.sh
- Follow the instruction on screen
- The output will be saved in the folder
trained_models/<current_date>_<your_model_name>/
- The
.sql
file will contain a SQL Case When statement imitating the decision tree/EBM
Configuration json
features
List with names of the columns which should be used as feature (optional)
model_params
Dictionary of parameters that can be used with model of choice (optional). Check the model's documentation:
- EBM (model documentation)
- Decision tree (model documentation)
- Decision rule (model documentation)
post_params
calibration
options (optional, not fully implemented):
sigmoid
, platt scaling appliedisotonic
, isotonic regression appliedauto
/true
, either platt scaling or isotonic regression applied based on datasize- any other value, no calibration applied
sql_split
options:
false
, outputs the SQL model as one column by adding all separate scores up directlytrue
, outputs the SQL model as one column for each feature and a total score columns afterwards. This might be needed to avoid some memory related (stackoverflow) error.
file_type
options (optional):
png
, output of features importance graphs will be static .png (smaller file).html
, output of features importance graphs will be dynamic .html (bigger file and opens in browser).
pre_params
cv_type
options (optional):
timeseriesplit
, perform 5 fold timeseries split (sklearn implementation)- any other value, perform 5 fold stratified cross validation
max_rows
options (not used currently):
- Any kind of whole positive number, will limit the data set in order to train faster (as simple as that)
time_sensitive_column
options (optional):
- Name of date column
- used when
cv_type = timeseriesplit
- used when out-of-time dataset is created (not implemented yet)
- used when
upsampling
options (optional, should not be used without calibration):
true
, applying the SMOTE(NC) algorithm on the minority class to balance the datafalse
, not applying any resampling technique
target
Name of target column (required)
Notes
- Any NULL values should be imputed before using this script
- Data imbalance treatments (e.g. oversampling + model calibration) not fully implemented
- Resampling (almost) always makes the trained model ill calibrated
- Multiclass and regression are experimental
TODO list
- Add calibration (platt scaling/isotonic regression)
- Implement null handling (there is an implementation mentioned here)
- Make multi class classification EBM work fully
- Make regression EBM work fully
- Removing outliers by using quantiles (e.g. only keeping 1 - 99 % quantiles)
- Add decision tree
- Add logistic regression
- Add Skope rules
- Spatial Cross-validation discovery
- Extend logging granularity (add model parameters)
- Use menu function bash for model type choosing
- Add target single unique value check
- Replace
classification_report
andconfusion_matrix
due to dependance on threshold - Add MCC, cohen kappa and other metrics plotted with threshold