How to find the GPU memory usage pattern in TensorFlow or Pytorch?

Question

How to find the GPU memory usage pattern in TensorFlow or Pytorch?

Xuyuanjia2014 opened this issue 4 years ago · comments

I have read your essay Fine-Grained GPU Sharing Primitives for Deep Learning Applications (MLSys 2020) and other deep learning scheduling including Gandiva (OSDI 2018), Tiresias (NSDI 2019).

Due to TensorFlow or Pytorch cache policy, I use the NVIDIA-SMI command to detect DL's GPU memory usage and it always got 100%.

It there any tools or methods I can use to get the similar characterization in Salus or Gandiva, maybe tensorflow profiler?

Aetf · Answer 1 · Sun May 31 2020 05:24:43 GMT+0800 (China Standard Time)

Hi,

nvidia-smi indeed doesn't work in this case due to the memory pooling.

Can't say for Gandiva. In Salus, I get the data by modifying Tensorflow/PyTorch's allocator and added customized logging for later post-processing.