Stonesjtu / pytorch_memlab

Profiling and inspecting memory in pytorch

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Question: discrepancy between MemReporter 'Used Memory' and 'nvtop'

indigoviolet opened this issue · comments

I'm trying to understand why there is a large discrepancy (between the 'Used Memory' or 'allocated memory on cuda:0' from MemReporter, versus what nvtop's (or nvidia-smi)'s reported memory usage. For example, while training a model (RetinaNet from detectron2, for context), I'm seeing ~285M from MemReporter, and ~15G from nvtop/nvidia-smi.

Is this all due to the autograd graph? I've been trying to read more about this but haven't found good references.

Thanks for your work on this library, and any pointers you can share about this!

  1. It depends on where the reporter is collecting GPU tensors, e.g. if you report the memory usage after backward, then probably most intermediate tensors are freed already.
  2. The 15G you get from nvtop / nvidia-smi is the peak memory usage during the whole computation (includes forward / backward / optimization), probably the CNN algorithm you chose requires a lot of memory as its' workspace, or the feature maps are too big while not tracked by python object.
  3. Pytorch-1.6 introduces a new memory profiler utils, I think this may help tackle the autograd graph problem in this tool.

Can you post a snippet how the memory reporter is used in you training scripts?

That's helpful, thanks for the quick response. I tried to add reporting at intermediate stages during training, and identified that the losses = model(input) step leads to the biggest jump in allocated memory. At the end of that step, MemReporter reports 7GB allocated (peak is now 15GB), of which about 150MB is accounted for by tensors in the report. I haven't been able to figure out how that 7GB breaks down beyond the 150MB, or (2) what is leading to the peak 15GB usage.

I guess a lot of this could be the feature maps, how can I confirm it?

I did try using the memory profiling using torch.autograd.profiler from pytorch 1.6, but so far I haven't been able to make great sense of the output (see below). I see that there are functions that are allocating lots of memory, but it's unclear how to trace where those functions are being invoked in a complex model.

------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Name                                  Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CPU Mem          Self CPU Mem     CUDA Mem         Self CUDA Mem    Number of Calls  
------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
empty                                 1.84%            24.998ms         1.84%            24.998ms         23.538us         168 b            168 b            24.20 Gb         24.20 Gb         1062             
resize_                               0.36%            4.899ms          0.36%            4.899ms          12.155us         312 b            312 b            12.33 Gb         12.33 Gb         403              
nonzero                               9.59%            130.246ms        9.59%            130.246ms        3.831ms          0 b              0 b              8.07 Gb          8.07 Gb          34               
sub                                   0.17%            2.371ms          0.44%            5.974ms          49.787us         0 b              0 b              6.75 Gb          0 b              120            

if you report the memory usage after losses = model(input), then most intermediate tensors go from python variables to C-level storages, which is not trackable in pure python code. I believe this is the reason for such a large gap between the reported results and the actual allocated memory.

Can you plz try to use the memory_profiler (https://github.com/Stonesjtu/pytorch_memlab#memory-profiler) to profile your model's forward function line by line.

I was able to get a better picture of why my peak memory usage was high after using the memory profiler. Thanks for your advice!