Why MPS only support 6 GPU
valiantljk opened this issue · comments
Hi,
In the Salus paper, regarding inference, it says:
Salus needs only 1 GPU, achieving 42 utilization improvement,
while the average latency overhead is less than 5ms.
For comparison, MPS needs 6 GPUs.
Could you explain why MPS needs 6 GPUS, what is the limitation on GPU that stops from running more instances of inference tasks?
For MPS, you need to ensure that the summation for all persistent (model and framework-internal) and all ephemeral memory doesn't exceed the GPU memory capacity.
While for Salus, the safety condition is relaxed to the summation for all persistent and the max of ephemeral memory doesn't exceed the capacity.
It's explained in detail in section 3.3.2 in the paper.