alibaba / clusterdata

cluster data collected from production clusters in Alibaba for cluster management research

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

GPU sharing

LalchandPandia opened this issue · comments

In section 5.1, the paper discusses GPU sharing:

  1. Is the GPU sharing both time and space multiplexed ? Because the footnote says that AntMan is only implemented in TensorFlow and PyTorch). So, is it the case that only GPU memory is shared among task instances(workers) or are SMs also shared among instances executing TensorFlow or PyTorch based code.
  2. What is the data used for Fig 8? Is it the "pai_job_duration_estimate_100K.csv"? How do you realize no GPU sharing? Are the requests between (0,1) changed to 1?

Thanks for your interest in our paper!

  1. Yes, a basic GPU sharing version, with both time and space multiplexing, is provided regardless of the DL framework. For a controlled, fine-grained sharing, e.g., each job has at most 50% of GPU time running its kernel (Fig. 8 in AntMan), seeking the framework-level support is one of the methods.

  2. It is not derived from "pai_job_duration_estimate_100K.csv" but from the original dataset. The In [10] of the jupyter notebook shown in README produces the "w/ GPU sharing" figures. Yes, without sharing means requests between (0, 1) are changed to 1.

Thanks for the prompt clarification!
The In [10] of the jupyter notebook shown in README gives Task Num requests hourly plot.
For Fig 8, when I used gpu_wrk_util, it gives much lower values.
Should I used plan_gpu, but will not that give GPUs requested?
Which field basically corresponds to #GPUs allocated?

Regarding the #GPUs allocated, it is suggested to use plan_gpu.

Thanks!