microsoft / superbenchmark

A validation and profiling tool for AI infrastructure

Home Page:https://aka.ms/superbench

Repository from Github https://github.commicrosoft/superbenchmarkRepository from Github https://github.commicrosoft/superbenchmark

'sb deploy' is expected to exit with non-zero when failed

LiweiPeng opened this issue · comments

Summary

This issue was found using v0.6.0 release. In the system, the ansible was not setup properly because of a test environment issue. When I ran 'sb deploy -f local.ini -i superbench/superbench:v0.6.0-cuda11.1.1'. It has error message like below. Although the error message said ansible.py return code 127, the sb program exit with 0.

[2022-09-09 18:49:10,573 N000000:30359][runner.py:43][INFO] Runner writes to: /home/aiscadmin/superbench/outputs/2022-09-09_18-49-10.
[2022-09-09 18:49:10,622 N000000:30359][runner.py:48][INFO] Runner will run: ['gpu-burn', 'nccl-bw:default', 'nccl-bw:gdr-only', 'ib-loopback', 'mem-bw', 'gpu-copy-bw:correctness', 'gpu-copy-bw:perf', 'kernel-launch', 'gemm-flops', 'cudnn-function', 'cublas-function', 'matmul', 'sharding-matmul', 'computation-communication-overlap', 'ort-inference', 'tensorrt-inference', 'gpt_models', 'bert_models', 'lstm_models', 'resnet_models', 'densenet_models', 'vgg_models']
[2022-09-09 18:49:10,622 N000000:30359][runner.py:165][INFO] Preparing SuperBench environment.
[2022-09-09 18:49:10,622 N000000:30359][ansible.py:125][INFO] Run playbook deploy.yaml ...
The command was not found or was not executable: ansible-playbook.
[2022-09-09 18:49:10,628 N000000:30359][ansible.py:80][WARNING] Run failed, return code 127.

$ echo $?
0

How to repro

Setup superbench normally. Before running 'sb deploy', remove the ~/ .ansible directory. Then run 'sb deploy' like above.