nutanix / nai-dl-bench

ML workflow validation package

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool


Table of Contents



MPIRun setup:

Install MPIRun in each node:

sudo apt install openmpi-bin

Enable Passwordless SSH Login from master node to all other nodes. Choose any one of the participating nodes to be the master node.
ssh-keygen on master node Run the following command on the master node for each of the worker node, assuming ubuntu is the username for all vms:

ssh-copy-id -i /home/ubuntu/.ssh/id_rsa <username>@<worker node ip>
  • Dataset has to be accessible from all nodes eg: NFS
  • Absolute path of the dataset and training code need to be the same across all nodes

Training Run

To start multi node training, use the following command


-n Number of training processes
-h Comma separated list of Host IPs
-m IP Address of master node
-c Python command with space separated list of --<option> <argument> inside double quotes for the training script. data and output options are mandatory for the script

In the following examples a node with ip is chosen as the master node.

single node training command using 1 gpus:-

bash training/code/ -n 1 -h -m  -c "python3 --data-folder /home/ubuntu/data --output-folder /home/ubuntu/output --model resnet50 --output-model-file resnet.pth --batch-size 12 --workers 1 --pf 2 --num-epochs 1"

2 nodes ( and training command using 1 gpu each, run the command on master node:-

bash training/code/ -n 2 -h, -m   -c "python3 --data-folder /home/ubuntu/data --output-folder /home/ubuntu/output --model resnet50 --output-model-file resnet.pth --batch-size 12 --workers 1 --pf 2 --num-epochs 1"


Setup TorchServe

Install openjdk, pip3

sudo apt-get install openjdk-17-jdk python3-pip

Nvidia driver installation: Reference:

Clone this repo and select torchserve folder Install TS libraries

cd inference/code/torchserve
pip install -r requirements.txt

Create .mar file for resnet50

Generate new using the eager mode

python --model_name resnet50 --weight ResNet50_Weights.DEFAULT

Generate resnet50.mar file

torch-model-archiver --model-name resnet50 --version 1.0 --model-file models/resnet50/ --serialized-file --handler image_classifier --extra-files index_to_name.json

Create a folder and move the .mar file inside it

mkdir model_store
mv resnet50.mar model_store/resnet50.mar

Start Torchserve Server

Torchserve Start command
torchserve --start --ncs --model-store model_store --ts-config --log-config log4j2.xml
Health Check
curl http://localhost:8080/ping
Register model

curl -X POST "http://{inference_endpoint}:{management_port}/models?url={model_location}&initial_workers={number}&synchronous=true&batch_size={number}&max_batch_delay={delay_in_ms}"

curl -X POST  "http://localhost:8081/models?url=resnet50.mar&initial_workers=1&synchronous=true&batch_size=1&max_batch_delay=20"
Describe registered model

GET /models/{model_name}

curl http://localhost:8081/models/resnet50
Edit config for a registered model
curl -v -X PUT "http://localhost:8081/models/resnet50?min_worker=3&max_worker=6"
Inference Check

curl http://{inference_endpoint}/predictions/{model_name} -T {input_file}

Test input file can be found in data folder

curl http://localhost:8080/predictions/resnet50 -T input.jpg
Unregister a model

DELETE /models/{model_name}/{version}

curl -X DELETE http://localhost:8081/models/resnet50/1.0
Torchserve Stop command
torchserve --stop

For more detailed explanations on using the management endpoint. Check out -

By default, TorchServe uses all available GPUs for inference. Use number_of_gpu in the file to limit the usage of GPUs

Properties in file can be updated as required Reference:

Log level can be set as required by modifying the log4j2.xml file

Automated Setup and Inference run

You can test your trained model end to end with running the inference run script with the requirement arguments is inside inference/code/torchserve folder

command to run inference


-n Name of the Model
-d Absolute path to the inputs folder that contains data to be predicted.
-m Absolute path to the saved model file
-f Absolute path to the model arch file
-c Absolute path classes mapping file
-h Absolute path handler file
-e Comma separated absolute paths of all the additional paths required by the model
-g Number of gpus to be used to execute. Default will be 0, cpu used
-a Absolute path to the model archive file (.mar)
-k Keep the torchserve server alive after run completion. Default, stops the server if not set

Inference run should print "Inference Run Successful" as a message at the end.

Inference run using default models

  • Run Inference on the existing standard resnet50/densenet161/fasterrcnn_resnet50_fpn model provided in this repo. Set the name parameter as required.
bash inference/code/torchserve/ -n resnet50
bash inference/code/torchserve/ -n fasterrcnn_resnet50_fpn
  • For running inference with data folder. Here the path should contain only files that are acceptable for inference.
bash inference/code/torchserve/ -n resnet50 -d inference/data
  • Run Inference on the trained resnet50 model that was generated using the training code provided in this repo.
bash inference/code/torchserve/ -n resnet50 -d /home/ubuntu/data -m

Inference run using custom trained models

  • Run Inference on the custom model of your choice. Make sure to set all the parameters as shown in the example
bash inference/code/torchserve/ -n resnet50 -d /home/ubuntu/data -m /home/ubuntu/model/ -f /home/ubuntu/model/ -c /home/ubuntu/index_to_name.json -h image_classifier -e /home/gavrishdemo/test/ -g 2
  • Custom trained model can be added as a default option
  • Create a folder inside "models" folder with the name of the model and add all the required files
        -   // custom saved model can be stored in any location. Provide absolute path during cmd execution
        - class_map.json
  • make an entry for this custom model in "models/models.json"
    "custom100": {
        "model_arch_file": "",
        "handler": "",
        "class_map": "class_map.json"
  • run command
bash inference/code/torchserve/ -n custom100 -d /home/ubuntu/data -m models/custom100/

Inference run using pre-existing MAR files

  • Run inference using custom created mar files directly
bash inference/code/torchserve/ -a /home/ubuntu/custom50.mar -d inference/data

Fine tune params for better performance

  • Default parameters can be overidden to get better performance out of the registered model
  • make an entry for the model in "models/models.json"
    "custom200": {
        "initial_workers": "4",
        "batch_size": "16",
        "max_batch_delay": "400",
        "response_timeout": "2000"
  • make sure to provide the key as name in the command for "-n"
bash inference/code/torchserve/ -n custom200 -a /home/ubuntu/custom200.mar


bash inference/code/torchserve/ -n custom200 -d /home/ubuntu/data -m models/custom200/


ML workflow validation package


Language:Python 75.1%Language:Shell 24.6%Language:Dockerfile 0.3%