SymbioticLab / Salus

Fine-grained GPU sharing primitives

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

error when running the benchmark

lethean1 opened this issue · comments

Hi, I meet some problems running your benchmark.
I did the following to install Salus.
first, start the server:docker run --rm -it registry.gitlab.com/salus/salus and get the following result:

[2022-12-19 08:59:16.965984] [1] [default] [I] Running build type: Debug
[2022-12-19 08:59:16.966143] [1] [default] [I] Verbose logging level: 0 file: verbose.log
[2022-12-19 08:59:16.966157] [1] [default] [I] Performance logging: disabled file: verbose.log
[2022-12-19 08:59:16.966168] [1] [default] [I] Allocation logging: enabled
[2022-12-19 08:59:16.966177] [1] [default] [I] Scheduling parameters:
[2022-12-19 08:59:16.966187] [1] [default] [I]     Policy: pack
[2022-12-19 08:59:16.966197] [1] [default] [I]     MaxQueueHeadWaiting: 50
[2022-12-19 08:59:16.966206] [1] [default] [I]     WorkConservative: on
[2022-12-19 08:59:16.966333] [41] [default] [I] TaskExecutor scheduling thread started
[2022-12-19 08:59:16.966463] [42] [default] [I] ExecutionEngine scheduling thread started
[2022-12-19 08:59:16.966914] [1] [default] [I] Starting server listening at tcp://*:5501

then I run the benchmark in the same docker:

pip3 install -r requirements.txt
python3 -m benchmarks.driver exp308

And meet this error:

root@349ebb6dcb74:~/Salus# python3 -m benchmarks.driver card308
[2022-12-19 13:06:40,304] [cli] [INFO] Running experiment: benchmarks.exps.card308
[2022-12-19 13:06:40,304] [cli] [INFO] Saving log files to: /root/Salus/scripts/templogs/card308
[2022-12-19 13:06:40,307] [benchmarks.exps.card308] [INFO] **** Saving SavedModel: vgg11eval_1
[2022-12-19 13:06:40,307] [benchmarks.exps.card308] [INFO] **** Location: /root/Salus/scripts/templogs/card308
[2022-12-19 13:06:40,308] [benchmarks.driver.utils.utils] [INFO] Using temporary directory: /dev/shm/tmpu8dbzl_q
[2022-12-19 13:06:40,308] [benchmarks.driver.workload] [INFO] Starting workload `vgg11eval_1' on TF with output file: /dev/shm/tmpu8dbzl_q/vgg11eval_1.tf.1iter.0.output
[2022-12-19 13:06:40,308] [benchmarks.driver.runner] [INFO] Starting workload with cmd: ['stdbuf', '-o0', '-e0', '--', 'python', 'tf_cnn_benchmarks.py', '--display_every=1', '--num_gpus=1', '--variable_update=parameter_server', '--nodistortions', '--executor=tf', '--num_batches=1', '--batch_size=1', '--model_dir=/symbiotic/peifeng/tf_cnn_benchmarks_models/legacy_checkpoint_models/vgg11', '--model=vgg11', '--eval_block=true', '--eval', '--saved_model_dir=/symbiotic/peifeng/tf_cnn_benchmarks_models/saved_models/vgg11']
[2022-12-19 13:06:40,312] [benchmarks.exps] [INFO] Waiting all workloads to finish
[2022-12-19 13:06:40,339] [benchmarks.driver.server] [INFO] Workload vgg11eval_1 exited with 1
Press enter to continue...
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/local/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/root/Salus/benchmarks/driver/__main__.py", line 212, in <module>
    sys.exit(main())
  File "/root/Salus/benchmarks/driver/__main__.py", line 198, in main
    expm.main(argv)
  File "/root/Salus/benchmarks/exps/card308.py", line 64, in main
    run_tf(FLAGS.save_dir, wl)
  File "/root/Salus/benchmarks/exps/__init__.py", line 124, in run_tf
    raise RuntimeError(f'Workload {w.canonical_name} did not finish cleanly: {w.proc.returncode}')
RuntimeError: Workload vgg11eval_1 did not finish cleanly: 1

Any help to solve this problem is appreciated!!!