R + H2O Training Pipeline and REST API
Objective of this project
The main point of this repository is to raise a REST API completely from scratch using H2O and the R language.
Thus, the objective of this repository is to empower developers in R and other data scientists to raise this type of service in their infrastructure and allow the use of R and H2O in production. This project has some limitations that can be seen below.
Objects and Folders
The project is organized as follows:
src
: Main script for model training and model serializationapi
: Files regarding the endpoint and prediction functionsdata
: Raw data for traininglogs
: Stores the training log and API logsmodels
: Main folder where the serialized models are stored
Bank Layman Brothers Loan API
The scenario where will be to build a REST API that will receive some pieces of information about some customers and will perform the probability of a consumer enter in default situation with their loan.
The data is stored in Github and has the following fields:
ID
: Customer IDLIMIT_BAL
: Balance limit for the customerSEX
: Customer genderEDUCATION
: Customer educationMARRIAGE
: Informs if the customer is married or notAGE
: Customer age in the moment of loanPAY_0 ... PAY_6
: Values of payments for the loan (principal)BILL_AMT1...BILL_AMT6
: Bill amountPAY_AMT1...PAY_AMT6
: Amount paidDEFAULT
: Informs if the customer entered in default or not
Before execution
Due to some limitations of R language to establishing the file paths (absolute/relative paths) I included all paths directly in the code (not ideal).
Said that those paths should be changed for your local directory before run. The files that need to be changed are:
src/gbm-layman-brothers.r
- line: 11api/endpoint.r
- line: 7api/api.r
- line: 10
Training Pipeline
There's some alternatives to run a R
command from terminal. The most elegant way to run a R script in a bash call it's the following one that I unshamelessly took from the Stack Overflow.
After the paths being changed, to execute the training just run the following line in terminal:
$ R < /<<YOUR-PATH>>/r-prediction-rest-api/src/gbm-layman-brothers.R --no-save
This script will run a training pipeline using AutoML in H2O, generate the logs in logs
folder, pick the best model and save it (serialize it) in models
folder.
Start the REST API from command line
Unfortunately due to the fact that H2O.ai do not have an option to set a name directly in AutoML models in the moment of serialization (model_id
field) you will need to take the name of the file and change it in the api/api.r
file in the line 21
and put the name of your file.
After your model being serialized and change this line, you should run the following command in terminal:<
$ R < /<<YOUR-PATH>>/r-prediction-rest-api/api/endpoint.R --no-save
API Swagger for testing
After your API just started, you can click in the following link and test the Swagger API with the values for each field:
http://127.0.0.1:8000/__swagger__/
API request via curl
Default FALSE:
$ curl -X POST "http://127.0.0.1:8000/prediction?PAY_AMT6=0&PAY_AMT5=0&PAY_AMT4=0&PAY_AMT3=0&PAY_AMT2=689&PAY_AMT1=0&BILL_AMT6=0&BILL_AMT5=0&BILL_AMT4=0&BILL_AMT3=689&BILL_AMT2=3102&BILL_AMT1=3913&PAY_6=-2&PAY_5=-2&PAY_4=-1&PAY_3=-1&PAY_2=2&PAY_0=2&AGE=24&MARRIAGE=1&EDUCATION=2&SEX=2&LIMIT_BAL=20000" -H "accept: application/json"
Default TRUE:
$ curl -X POST "http://127.0.0.1:8000/prediction?PAY_AMT6=5000&PAY_AMT5=1000&PAY_AMT4=1000&PAY_AMT3=1000&PAY_AMT2=1500&PAY_AMT1=1518&BILL_AMT6=15549&BILL_AMT5=14948&BILL_AMT4=14331&BILL_AMT3=13559&BILL_AMT2=14027&BILL_AMT1=29239&PAY_6=0&PAY_5=0&PAY_4=0&PAY_3=0&PAY_2=0&PAY_0=0&AGE=34&MARRIAGE=2&EDUCATION=2&SEX=2&LIMIT_BAL=90000" -H "accept: application/json"
See the logs
If you want to see the logs, just tail those logs using the following commands:
-
Training Pipeline:
$ training_pipeline_auto_ml.log
-
API Predictions:
$ tail -F api_predictions.log
Limitations
A more attentive developer will notice that he does not have some important limitations in the project such as: (1) considering sex as an independent variable, (2) logs outside a standard format, (3) no test cases, (4) fine tuning of the algorithm.
This happens due to the fact that the point here is to make a REST API from scratch using only R as the language, so here the point will be to start very simple and with limitations and it is up to the developer to adjust to his taste.
Another important point is that R is not a general programming language, at best R is a scientific programming language through scripting. This implies that the language has numerous inherent limitations by design.
TODO
- Error handling
- API security headers
- Enhance logging for predictions
- Have some bindings between IP and model requests
- Docker image