Memory of gpu & cpu keep increasing during training (pytorch)

Question

Memory of gpu & cpu keep increasing during training (pytorch)

aylive opened this issue 6 months ago · comments

Impressive and very helpful work. Just a little confuse, when trying to repeat the training on COCO with PyTorch implement (default configs), I noticed that the memory of the CPU and GPU both keep increasing as the iteration goes on. I tried this on two servers,

intel core i9 + rtx4090*1
intel xeon + rtx3080*1
both of a single GPU (:< sorry for no more details about the servers, I'll add more details if needed)

As for now, the training process has not been killed due to insufficient memory. But as the CPU memory gets to be all taken up, the training speed slows down a lot.

I'm really struggling with this. Great thankfulness for your help.

lyuwenyu · Answer 1 · Fri Dec 29 2023 10:48:33 GMT+0800 (China Standard Time)

#93

I don't know where the problem is either. But I will release a new version codebase in future, you can star and keep following updates.

Tommy · Answer 2 · Tue Jan 02 2024 11:33:00 GMT+0800 (China Standard Time)

Impressive and very helpful work. Just a little confuse, when trying to repeat the training on COCO with PyTorch implement (default configs), I noticed that the memory of the CPU and GPU both keep increasing as the iteration goes on. I tried this on two servers,

intel core i9 + rtx4090*1

intel xeon + rtx3080*1
both of a single GPU (:< sorry for no more details about the servers, I'll add more details if needed)

As for now, the training process has not been killed due to insufficient memory. But as the CPU memory gets to be all taken up, the training speed slows down a lot.

I'm really struggling with this. Great thankfulness for your help.

Do you run evaluation after each training epoch? I tried to turn off evaluation, and the speed is much faster. I wonder the PyTorch implementation for COCO, please share more infos, thanks a lot!

lyuwenyu · Answer 3 · Tue Jan 02 2024 11:51:19 GMT+0800 (China Standard Time)

Yes, I do run evaluation after each epoch.

Tommy · Answer 4 · Tue Jan 02 2024 11:53:09 GMT+0800 (China Standard Time)

Yes, I do run evaluation after each epoch.

Thanks. I meet the same memory issue as reported in #93 and here. After I turn off evaluation after each training epoch, the performance seems to be OK.

lyuwenyu · Answer 5 · Tue Jan 02 2024 11:59:18 GMT+0800 (China Standard Time)

. I wonder the PyTorch implementation for COCO, please share more infos, thanks a lot!

Very useful information, perhaps you are right

Tommy · Answer 6 · Tue Jan 02 2024 12:00:46 GMT+0800 (China Standard Time)

. I wonder the PyTorch implementation for COCO, please share more infos, thanks a lot!

Very useful information, perhaps you are right

Thanks, and thank you for the great work! I will do more test locally to find out this issue if I have some time.

sxli · Answer 7 · Mon Jan 08 2024 08:36:17 GMT+0800 (China Standard Time)

Yes, I do run evaluation after each epoch.

Thanks. I meet the same memory issue as reported in #93 and here. After I turn off evaluation after each training epoch, the performance seems to be OK.

Thanks for your info. I'll try this. Just I have to fintune on my own dataset, no evaluation hampers me from stopping before overfitting. How do you solve this problem?

Tommy · Answer 8 · Mon Jan 08 2024 08:40:37 GMT+0800 (China Standard Time)

Yes, I do run evaluation after each epoch.

Thanks. I meet the same memory issue as reported in #93 and here. After I turn off evaluation after each training epoch, the performance seems to be OK.

Thanks for your info. I'll try this. Just I have to fintune on my own dataset, no evaluation hampers me from stopping before overfitting. How do you solve this problem?

Just manually eval each epoch's model. For finetune maybe 3-5 epoches are enough.