MONet does not save the memory used by PyTorch

Question

MONet does not save the memory used by PyTorch

merrymercy opened this issue 3 years ago · comments

Hi, thanks for the awesome library.

How do you measure the used memory? Is it emperically measured or is it theorically computed?
I measured the memory usage by nvidia-smi and found MONet does not save the memory used by PyTorch.

First, I run the 10GB solution by python3 imagenet.py ~/imagenet -a resnet50 --gpu 0 --epochs 1 --batch-size 184 --sol ../data/monet_r50_184_24hr/solution_resnet50_184_inplace_conv_multiway_newnode_10.00.pkl. The peak memory reported by nvidia-smi is around 12GB.
Then, I run the 6GB solution by python3 imagenet.py ~/imagenet -a resnet50 --gpu 0 --epochs 1 --batch-size 184 --sol ../data/monet_r50_184_24hr/solution_resnet50_184_inplace_conv_multiway_newnode_6.00.pkl. The peak memory reported by nvidia-smi is still around 12GB.

How to use MONet to actually save the memory used by PyTorch?

aashaka · Answer 1 · Fri Feb 12 2021 15:02:03 GMT+0800 (China Standard Time)

Thanks for the comments.

We measured the used memory using torch.cuda.memory_allocated. This gives us the total memory used by the tensors in PyTorch. nvidia-smi, on the other hand, shows the total GPU memory used by the system. The difference between the two is because PyTorch's caching memory allocator does not release GPU memory to the system after tensors are deallocated. While this design works well for DL workloads in general, it is not great for checkpointing because the memory of the deallocated tensors needs to be explicitly returned to the system in most cases.

We noticed that allocating a tensor pool the size of the expected memory usage goes a long way in bringing the actual system memory used close to the total memory used by PyTorch tensors. This can be done from the Python code by adding the following lines before the training loop:

pool = torch.zeros(expected_memory/4).cuda()
del pool

aashaka · Answer 2 · Wed Mar 31 2021 08:04:42 GMT+0800 (China Standard Time)

Added an explanation about this in the README. Closing this issue now.