Peer-review

A structured project with data, notebooks, models and workflow to allow scientists to consistently replicate experiments. A typical data science experiment includes exploring data, building features (also called feature engineering), training multiple models inorder to compare and select one, optimizing the trained model through hyperparameter tuning and finally running prediction on new data (model inference). The notebooks have been named and numbered to easily identify the order in the workflow.

Project Organization

README.md
LICENSE
config                          <- store path, data, model and run related configurations
run.py                          <- custom script to run notebooks via command line

data                            <- folder for training and test dataset
   └── input                    <- raw input data
   └── train-test               <- transformed input data for training and testing
   └── inference                <- data for inference

models                          <- folder to store ml models
   └── trained                  <- subfolder to store trained ml models
   └── tuned                    <- subfolder to store tuned ml models

notebooks                       <- folder for various notebooks
   └── 01-explore-data.ipynb    <- explore and visualize data
   └── 02-build-features.ipynb  <- clean and transform data for training and testing
   └── 03-train-model.ipynb     <- train multiple models and compare them
   └── 04-tune-model.ipynb      <- optimize ml model through hyperparameter tuning
   └── 05-model-inference.ipynb <- run prediction on new data

experiments                     <- folder to store artifacts from various runs
   └── experiment_0             <- folder to store results for each run.
           └── plots            <- contains plots from explore data notebook
           └── train-test       <- train and test data from build features notebook    
       └── models
           └── trained          <- trained models from train model notebook
           └── tuned            <- optimized model from tune model notebook
       └── notebooks            <- folder to save executed notebooks
       └── prediction results   <- prediction results from model inference notebook
       

prediction.py                   <- the predict function called from Flask
wsgi.py                         <- basic Flask application
flask_run.ipynb                 <- run Flask application on localhost
flask_test.ipynb                <- test Flask application on localhost
Dockerfile                      <- Docker configuration to create containers

requirements.txt                <- List of all required packages
lib                             <- library routines to read configuration details

.gitignore                      <- standard python gitignore
.s2i                            <- hidden folder for advanced s2i configuration
   └── environment              <- s2i environment settings
gunicorn_config.py              <- configuration for gunicorn when run in OpenShift

Steps

Log into OpenShift Cluster and click on the Red Hat Applications (Rubik cube like icon). Under OpenShift Managed Services, click Red Hat OpenShift Data Science. If you are trying out OpenShift Data Science in the sandbox from https://developers.redhat.com/products/red-hat-openshift-data-science/overview, please skip to Step 2

You will see the following screen open in a separate tab.

If it is the sandbox, you will see the following

Click on 'Launch Application' in the Jupyterhub tile. In the following page, select 'TensoFlow' as Notebook Image and 'Small' as Container Size. Click on 'Start Server' as shown below.

For Sandbox, you will see

The rest of the screenshots will the same across a full deployment or the sandbox. Once you click on 'Start Server', you should see the following screens. If any error in starting the server, please click on 'Expand Event log' and refer to the log messages. Now we are ready to move on to setup.

Set up

First step in set up is to clone the repo into the JupyterLab. There are multiple ways to do this: From the drop down menu 'Git', select 'Clone a Repository' (or) use 'Git' on the left panel and click on 'Clone a Repository' button (screenshot shown below) or use the terminal and clone repository using cli 'git clone https://github.com/rh-aiservices-bu/peer-review.git'

Once cloned, you should be able to see the contents of the peer-review repo as below.

Scroll down the 'Launcher' panel on the right to view the 'Terminal under 'Other'. Click on the Terminal and type the following commands 'cd peer-review' and 'pip install -r requirements.txt'

Experiment replication using RedHat OpenShift Data Science

This repository is structured in a way to serve as guidance in packaging data, notebooks, models, etc., such that it is easier to replicate the experiments or workflow (combination of notebook, models and results) once published.

Review/Rerun published notebooks

First step in the workflow is to configure the path, model, data, and run information in config file. Prior to the each replication or run, users need to update the version number under [Run] section in Config. This allows results from the following run to be recorded under experiments folder in their respective versioned folder. For example, version 1 of the run would be recorded under experiments/experiment_1.

Note that the other parameters in Config file can be left as is for now. To start with, we are attempting to rerun the notebooks with the default data, i.e., data used by the author while publishing the experiment. In the later steps, we will show how users can point a run to custom data.

Users can review as well as rerun the notebooks in the typical order of the data science workflow. Users can either execute all the cells in a notebook (using the option 'Run all cells' from the drop down menu 'Run') or walk through and execute each cell using the play button ('Run the selected cells and advance')

Repeat Step 2 for all the 5 notebooks. The following artifacts are stored at the end of each notebook,

   experiments                   
    └── experiment_1             <- assuming user sets version in config to 1 
        └── data                    
            └── plots            <- 01-explore-data.ipynb
            └── train-test       <- 02-build-features.ipynb
        └── models
            └── trained          <- 03-train-model.ipynb
            └── tuned            <- 04-tune-model.ipynb
        └── prediction results   <- 05-model-inference.ipynb

For example, after executing all the notebooks, you will see the following files populated under experiments/experiments_1. This is helpful when performing multiple runs of the experiments and comparing results with published version.

Rerun published notebooks with custom data

Replicating the experiments with custom data is super easy. Upload your new data and update the config to point to your data. For example, assume user uploads my_input.csv to data/input folder. Update the config file (value for 'input' in Section [Data] and the version for the current run) as shown below.

Then rerun all the notebooks. You should observe that the results are stored in experiments/experiment_2 folder.

Now you can compare your results against the published version/your rerun of the published version. For example, the screenshot below shows comparison of the trained models from both runs

Running notebooks in command line

The script 'run.py' can be used to execute the notebooks command line. This script provides the options to execute notebooks individually or all at once. Use the up or down arrow to make your selection.

A sample execution is shown below. First, edit the config file to record a new version of the run

Execute 'run.py' and select the option 'all'. Notebooks get executed in the order defined in the script.

After run.py completes, the results can be found under experiements/experiments_3 folder

Serving the model in OpenShift

To serve the model in OpenShift, you need to

 1. Wrap the model inference code in to prediction function            <- prediction.py
 2. Build a Flask application to serve the prediction function         <- wsgi.py
     └── Run and test the Flask application locally before deploying   <- flask_run.ipynb and flask_test.ipynb
         it to OpenShift
 3. Create Dockerfile to containerize everything - Flask application,  <- Dockerfile
    prediction code, model and dependencies.
 4. Deploy container to OpenShift and run prediction via REST API

All the above are provided with the repository. Some things to note,

the prediction function is simply the 05-model-inference.ipynb notebook in the form of a function.
Flask application allows you to create a web application that serves the user input into the prediction function and return back the result.
Dockerfile copies all the necessary files including the saved model, as well as includes commands to install all necessary packages

Deploying container in OpenShift

On the Red Hat OpenShift Data Science page, click on the Rubik cube icon to list the Red Hat Applications and click on OpenShift console

Once you are inside the OpenShift console, Click on 'Topology '. Right click on the right panel to bring up 'Add to Project'. Then select 'Import from Git'

Enter the Git repo URL https://github.com/rh-aiservices-bu/peer-review.git and click 'Create'

Wait for the Build to complete and Pod ready. Once the pod is ready, you can see a route available to access this application

Test the application using the curl command from flask_test.ipynb but use the route copied from above

or paste the curl command in the terminal to get the prediction for sample data,

[1002700000@jupyterhub-nb-panbalag experiments]$ curl -X POST -H "Content-Type: application/json" --data '{"HR": 103, "O2Sat": 90, "Temp": null, "SBP": null, "MAP": null, "DBP": null, "Resp": 30, "EtCO2": null, "BaseExcess": 21, "HCO3": 45, "FiO2": null, "pH": 7.37, "PaCO2": 90, "SaO2": 91, "AST": 16, "BUN": 14, "Alkalinephos": 98, "Calcium": 9.3, "Chloride": 85, "Creatinine": 0.7, "Glucose": 193, "Lactate": null, "Magnesium": 2, "Phosphate": 3.3, "Potassium": 3.8, "Bilirubin_total": 0.3, "Hct": 37.2, "Hgb": 12.5, "PTT": null, "WBC": 5.7, "Fibrinogen": null, "Platelets": 317}' http://peer-review-panbalag-dev.apps.rhods-sb-prod.3sox.p1.openshiftapps.com/predictions
{
  "prediction": "No Sepsis"
}

rh-aiservices-bu / peer-review