Evaluation during training fails - (multi-gpu/distributed) [BoxInst]
ameyparanjape opened this issue · comments
Many thanks to BoxInst authors for sharing the codebase for training and evaluating using BoxInst.
I am facing an issue while finetuning BoxInst models on my custom data.
Specs: Multi-gpu training on Linux VMs (4 Nvidia Tesla T4 GPUs)
I am using same code provided in this repo with some dataloader manipulations to finetune the COCO checkpoint on my custom data. During training I am using previously Instance segmentation annotated data for validation/testing, but COCO evaluation fails.
When I try to run --eval-only
mode on the same data on 1 GPU, I can get the evaluation results. Is there any way/fix to this to be able to perform evaluation during training? Is this problem caused due to distributed evaluation running into stack race?
@ameyparanjape can you share if you succeeded at solving this?