PyTorch code training may have memory leak

Question

PyTorch code training may have memory leak

DrRyanHuang opened this issue 9 months ago · comments

Ryan commented 9 months ago

I encounter memory overflow on another server, leading to system freeze, which may cause the following problems:

#185
#206

lyuwenyu · Answer 1 · Wed Feb 21 2024 10:57:34 GMT+0800 (China Standard Time)

( add related issue #93, #172

Can you do more test locally and try to solve this problem?

Ryan · Answer 2 · Wed Feb 21 2024 11:10:39 GMT+0800 (China Standard Time)

2 days ago, I used gc to analyze memory leaks.
It seemed that the data set was not released after training/eval for one epoch, but I was very unsure because I didn't have enough time to do it.

Hope this helps you solve this problem, I add these codes after train_one_epoch.

    # if cuda_empty_cache:
    #     del metric_logger
    #     gc.collect()
    #     # torch.cuda.empty_cache()
    
    # print(f"Number of objects in gc.garbage: {len(gc.garbage)}")

    # ann = []
    # for cycle in cycles:
    #     if isinstance(cycle, dict) and 'bbox' in cycle:
    #         ann.append(cycle)

    # for obj in ann: 
    #     referrers = gc.get_referrers(obj)
    #     print(f"Referrers of {obj}: {referrers}")
    #     break

ShafaMW · Answer 3 · Fri Jun 14 2024 12:35:05 GMT+0800 (China Standard Time)

2 days ago, I used gc to analyze memory leaks. It seemed that the data set was not released after training/eval for one epoch, but I was very unsure because I didn't have enough time to do it.

Hope this helps you solve this problem, I add these codes after train_one_epoch.

    # if cuda_empty_cache:
    #     del metric_logger
    #     gc.collect()
    #     # torch.cuda.empty_cache()
    
    # print(f"Number of objects in gc.garbage: {len(gc.garbage)}")

    # ann = []
    # for cycle in cycles:
    #     if isinstance(cycle, dict) and 'bbox' in cycle:
    #         ann.append(cycle)

    # for obj in ann: 
    #     referrers = gc.get_referrers(obj)
    #     print(f"Referrers of {obj}: {referrers}")
    #     break

hello would you mind providing the full file? I'm confused how to use your solution. For example, I don't understand what's contained in the cycles variable.