Evaluation during training fails - (multi-gpu/distributed) [BoxInst]

Question

Evaluation during training fails - (multi-gpu/distributed) [BoxInst]

ameyparanjape opened this issue 2 years ago · comments

Many thanks to BoxInst authors for sharing the codebase for training and evaluating using BoxInst.
I am facing an issue while finetuning BoxInst models on my custom data.
Specs: Multi-gpu training on Linux VMs (4 Nvidia Tesla T4 GPUs)
I am using same code provided in this repo with some dataloader manipulations to finetune the COCO checkpoint on my custom data. During training I am using previously Instance segmentation annotated data for validation/testing, but COCO evaluation fails.
When I try to run --eval-only mode on the same data on 1 GPU, I can get the evaluation results. Is there any way/fix to this to be able to perform evaluation during training? Is this problem caused due to distributed evaluation running into stack race?

engrjav · Answer 1 · Sat Jul 09 2022 13:12:52 GMT+0800 (China Standard Time)

@ameyparanjape can you share if you succeeded at solving this?