aosokin / os2d

OS2D: One-Stage One-Shot Object Detection by Matching Anchor Features

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Memory Leak when training and long evaluation time

chasb799 opened this issue · comments

Hey guys,

I tried to train the v2 model for 200k iterations with batch size 4 and without hard patch mining. For training I use a server cluster with A100 GPU (40 GB memory), 1080 GB RAM and AMD EPYC 7662 64 core CPU. I noticed that both GPU memory and RAM are increasing enormously, so that the training process is killed after around 50k iterations because of a full RAM. Has anybody had the same problem or does anybody know why there is a memory leak occuring? Furthermore the evaluation of the trained models is taking extremely long (~45 minutes for the 84 images in grozi-val-new-cl), is this normal behavior?

Best regards, Simon :)

Hi, this is definitely not the expected behavior. I do not remember anything like this.

Ok maybe I should try to use the exact versions for the dependencies, do you remember how long the training time was for 200k iterations with the v2 model? And what about about the time for one evaluation?

I've found this timing mentioned in the paper:
On GTX 1080Ti in our evaluation regime (with image pyramid) of the val-new-cl subset, our PyTorch [30] code computed input features in 0.46s per image and the heads took 0.052s and 0.064s per image per class for V1 and V2, respectively, out of which the transformation net itself took 0.020s. At training, we chose the number of classes such that the time on the heads matched the time of feature extraction. Training on a batch of 4 patches of size 600x600 and 15 classes took 0.7s.
This is the only reference point I currently have.