How to find the GPU memory usage pattern in TensorFlow or Pytorch?
Xuyuanjia2014 opened this issue · comments
I have read your essay Fine-Grained GPU Sharing Primitives for Deep Learning Applications (MLSys 2020) and other deep learning scheduling including Gandiva (OSDI 2018), Tiresias (NSDI 2019).
Due to TensorFlow or Pytorch cache policy, I use the NVIDIA-SMI command to detect DL's GPU memory usage and it always got 100%.
It there any tools or methods I can use to get the similar characterization in Salus or Gandiva, maybe tensorflow profiler?
Hi,
nvidia-smi
indeed doesn't work in this case due to the memory pooling.
Can't say for Gandiva. In Salus, I get the data by modifying Tensorflow/PyTorch's allocator and added customized logging for later post-processing.