Memory Leak when training and long evaluation time

Question

Memory Leak when training and long evaluation time

chasb799 opened this issue 8 months ago · comments

Hey guys,

I tried to train the v2 model for 200k iterations with batch size 4 and without hard patch mining. For training I use a server cluster with A100 GPU (40 GB memory), 1080 GB RAM and AMD EPYC 7662 64 core CPU. I noticed that both GPU memory and RAM are increasing enormously, so that the training process is killed after around 50k iterations because of a full RAM. Has anybody had the same problem or does anybody know why there is a memory leak occuring? Furthermore the evaluation of the trained models is taking extremely long (~45 minutes for the 84 images in grozi-val-new-cl), is this normal behavior?

Best regards, Simon :)

Anton Osokin · Answer 1 · Fri Nov 24 2023 18:12:37 GMT+0800 (China Standard Time)

Hi, this is definitely not the expected behavior. I do not remember anything like this.

Simon Bauer · Answer 2 · Mon Nov 27 2023 22:57:01 GMT+0800 (China Standard Time)

Ok maybe I should try to use the exact versions for the dependencies, do you remember how long the training time was for 200k iterations with the v2 model? And what about about the time for one evaluation?

Anton Osokin · Answer 3 · Tue Dec 05 2023 17:40:09 GMT+0800 (China Standard Time)

I've found this timing mentioned in the paper:
On GTX 1080Ti in our evaluation regime (with image pyramid) of the val-new-cl subset, our PyTorch [30] code computed input features in 0.46s per image and the heads took 0.052s and 0.064s per image per class for V1 and V2, respectively, out of which the transformation net itself took 0.020s. At training, we chose the number of classes such that the time on the heads matched the time of feature extraction. Training on a batch of 4 patches of size 600x600 and 15 classes took 0.7s.
This is the only reference point I currently have.