imlixinyang / HiSD

Official pytorch implementation of paper "Image-to-image Translation via Hierarchical Style Disentanglement" (CVPR 2021 Oral).

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Virtual memory usage is too large

PrototypeNx opened this issue · comments

Hello, the the model looks very good, but I encountered some problems when trying to train the model by myself.
My training environment is win10, torch1.8.0+cuda11, rtx3090, 32g memory.
When I use the default config, it will prompt cuda out of memory, but this is a misleading error message, cause there is still a lot of free cuda memory. When I tracked the hardware resources during the training process, I found that the amount of memory submitted before the formal training began to increase, which eventually led to overflow, which means that a huge amount of virtual memory was applied for before the training began. However, when I lower the parameters for normal training, the actual memory usage is very small, and the virtual memory usage is still large, but it will not reach the upper limit that was raised before the training started. I have never seen such a huge virtual memory overhead before, so I consider whether there is a memory leak problem during preload or preprocessing, and whether the program can be better optimized.
Thank you!

I don't know the actual reason for your situation. But there are several points may cause the problem in my view:

  1. data_prefetcher in https://github.com/imlixinyang/HiSD/blob/main/core/utils.py. Which can speed up the data loader.
  2. cudnn.benchmark in https://github.com/imlixinyang/HiSD/blob/main/core/train.py.
  3. Each iteration, the choice of modules from HiSD is different from previous single-path framework .
  4. Latent code may be not buffered in the same memory, which can be improved by register_buffer.

The 32GB memory (>2x1080Ti) is enough for config file (celeba-hq256.yaml), so I am superised to hear that it raise OUT OF MEMORY fault. Hope you reproduce the results successfully soon and you are welcomed to share more information or solutions here. I will try my best to help you.

Thank you for such a quick reply, I will check what you mentioned.
But I am sorry that I may not express it clearly, the reason for the training failure is more likely to be due to virtual memory rather than cuda memory.
I use RTX 3090 with 24GB of cuda memory and 32GB of RAM. The config uses the default celebA-HQ.yaml, and the virtual memory usage continues to rise before the training starts, reaching 64GB, causing overflow, but the cuda memory and RAM usage is very low. So I had to set the batch_size to 4 for training. At this time, the virtual memory occupies about 40GB. If I want to add a few training attributes, unfortunately the virtual memory overflows again.

I switched to an Ubuntu system under the same configuration, and there were no problems during training. I think the reason for the above problem is the different virtual memory allocation mechanism between Linux and Windows. Thank you again for your help!

Glad to hear that and you're always welcomed if there are any further problems.