utsaslab / MONeT

MONeT framework for reducing memory consumption of DNN training

Home Page:https://arxiv.org/abs/2010.14501

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

MONet does not save the memory used by PyTorch

merrymercy opened this issue · comments

Hi, thanks for the awesome library.

How do you measure the used memory? Is it emperically measured or is it theorically computed?
I measured the memory usage by nvidia-smi and found MONet does not save the memory used by PyTorch.

First, I run the 10GB solution by python3 imagenet.py ~/imagenet -a resnet50 --gpu 0 --epochs 1 --batch-size 184 --sol ../data/monet_r50_184_24hr/solution_resnet50_184_inplace_conv_multiway_newnode_10.00.pkl. The peak memory reported by nvidia-smi is around 12GB.
Then, I run the 6GB solution by python3 imagenet.py ~/imagenet -a resnet50 --gpu 0 --epochs 1 --batch-size 184 --sol ../data/monet_r50_184_24hr/solution_resnet50_184_inplace_conv_multiway_newnode_6.00.pkl. The peak memory reported by nvidia-smi is still around 12GB.

How to use MONet to actually save the memory used by PyTorch?

Thanks for the comments.

We measured the used memory using torch.cuda.memory_allocated. This gives us the total memory used by the tensors in PyTorch. nvidia-smi, on the other hand, shows the total GPU memory used by the system. The difference between the two is because PyTorch's caching memory allocator does not release GPU memory to the system after tensors are deallocated. While this design works well for DL workloads in general, it is not great for checkpointing because the memory of the deallocated tensors needs to be explicitly returned to the system in most cases.

We noticed that allocating a tensor pool the size of the expected memory usage goes a long way in bringing the actual system memory used close to the total memory used by PyTorch tensors. This can be done from the Python code by adding the following lines before the training loop:

pool = torch.zeros(expected_memory/4).cuda()
del pool

Added an explanation about this in the README. Closing this issue now.