lyuwenyu / RT-DETR

[CVPR 2024] Official RT-DETR (RTDETR paddle pytorch), Real-Time DEtection TRansformer, DETRs Beat YOLOs on Real-time Object Detection. 🔥 🔥 🔥

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Memory of gpu & cpu keep increasing during training (pytorch)

aylive opened this issue · comments

commented

Impressive and very helpful work. Just a little confuse, when trying to repeat the training on COCO with PyTorch implement (default configs), I noticed that the memory of the CPU and GPU both keep increasing as the iteration goes on. I tried this on two servers,

  1. intel core i9 + rtx4090*1
  2. intel xeon + rtx3080*1
    both of a single GPU (:< sorry for no more details about the servers, I'll add more details if needed)

As for now, the training process has not been killed due to insufficient memory. But as the CPU memory gets to be all taken up, the training speed slows down a lot.

I'm really struggling with this. Great thankfulness for your help.

#93

I don't know where the problem is either. But I will release a new version codebase in future, you can star and keep following updates.

Impressive and very helpful work. Just a little confuse, when trying to repeat the training on COCO with PyTorch implement (default configs), I noticed that the memory of the CPU and GPU both keep increasing as the iteration goes on. I tried this on two servers,

  1. intel core i9 + rtx4090*1
  2. intel xeon + rtx3080*1
    both of a single GPU (:< sorry for no more details about the servers, I'll add more details if needed)

As for now, the training process has not been killed due to insufficient memory. But as the CPU memory gets to be all taken up, the training speed slows down a lot.

I'm really struggling with this. Great thankfulness for your help.

Do you run evaluation after each training epoch? I tried to turn off evaluation, and the speed is much faster. I wonder the PyTorch implementation for COCO, please share more infos, thanks a lot!

Yes, I do run evaluation after each epoch.

Yes, I do run evaluation after each epoch.

Thanks. I meet the same memory issue as reported in #93 and here. After I turn off evaluation after each training epoch, the performance seems to be OK.

. I wonder the PyTorch implementation for COCO, please share more infos, thanks a lot!

Very useful information, perhaps you are right

. I wonder the PyTorch implementation for COCO, please share more infos, thanks a lot!

Very useful information, perhaps you are right

Thanks, and thank you for the great work! I will do more test locally to find out this issue if I have some time.

commented

Yes, I do run evaluation after each epoch.

Thanks. I meet the same memory issue as reported in #93 and here. After I turn off evaluation after each training epoch, the performance seems to be OK.

Thanks for your info. I'll try this. Just I have to fintune on my own dataset, no evaluation hampers me from stopping before overfitting. How do you solve this problem?

Yes, I do run evaluation after each epoch.

Thanks. I meet the same memory issue as reported in #93 and here. After I turn off evaluation after each training epoch, the performance seems to be OK.

Thanks for your info. I'll try this. Just I have to fintune on my own dataset, no evaluation hampers me from stopping before overfitting. How do you solve this problem?

Just manually eval each epoch's model. For finetune maybe 3-5 epoches are enough.