GPU sharing

Question

GPU sharing

LalchandPandia opened this issue 2 years ago · comments

LalchandPandia commented 2 years ago

In section 5.1, the paper discusses GPU sharing:

Is the GPU sharing both time and space multiplexed ? Because the footnote says that AntMan is only implemented in TensorFlow and PyTorch). So, is it the case that only GPU memory is shared among task instances(workers) or are SMs also shared among instances executing TensorFlow or PyTorch based code.
What is the data used for Fig 8? Is it the "pai_job_duration_estimate_100K.csv"? How do you realize no GPU sharing? Are the requests between (0,1) changed to 1?

LalchandPandia commented 2 years ago

Thanks!

Qizhen WENG · Answer 1 · Mon Dec 05 2022 16:30:19 GMT+0800 (China Standard Time)

Thanks for your interest in our paper!

Yes, a basic GPU sharing version, with both time and space multiplexing, is provided regardless of the DL framework. For a controlled, fine-grained sharing, e.g., each job has at most 50% of GPU time running its kernel (Fig. 8 in AntMan), seeking the framework-level support is one of the methods.
It is not derived from "pai_job_duration_estimate_100K.csv" but from the original dataset. The In [10] of the jupyter notebook shown in README produces the "w/ GPU sharing" figures. Yes, without sharing means requests between (0, 1) are changed to 1.

LalchandPandia · Answer 2 · Mon Dec 05 2022 21:45:28 GMT+0800 (China Standard Time)

Thanks for the prompt clarification!
The In [10] of the jupyter notebook shown in README gives Task Num requests hourly plot.
For Fig 8, when I used gpu_wrk_util, it gives much lower values.
Should I used plan_gpu, but will not that give GPUs requested?
Which field basically corresponds to #GPUs allocated?

Qizhen WENG · Answer 3 · Mon Dec 05 2022 22:50:07 GMT+0800 (China Standard Time)

Regarding the #GPUs allocated, it is suggested to use plan_gpu.