abhi1kumar / DEVIANT

[ECCV 2022] Official PyTorch Code of DEVIANT: Depth Equivariant Network for Monocular 3D Object Detection

Home Page:https://arxiv.org/abs/2207.10758

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Failure during Multi-GPU evaluation

danielvais opened this issue · comments

Hi,
I'm encountering an error in the first eval epoch. The error I get is:
Screenshot from 2024-01-17 16-32-04
I am running the gupnet model training:

CUDA_VISIBLE_DEVICES=0,1 python -u tools/train_val.py --config=experiments/run_221.yaml 

I was successful training the model on a sub dataset of only 300 images. The error appears what I train the full dataset.
Any suggestions?

Hi @danielvais
Thank you for your interest in DEVIANT again. Here are a couple of things I would try:

The error appears what I train the full dataset.

  • Try evaluation in single GPU setting:
CUDA_VISIBLE_DEVICES=0 python -u tools/train_val.py --config=experiments/run_221.yaml --resume_model=... -e
  • Also check if the val batch size is large for the available GPU memory. I see that you use a batch size of 6 . You could try changing this line to make batch size as 2.

Hi @abhi1kumar
As you advised in my previous issue, I can't train the full dataset on a single GPU due to lack of memory.

Reducing the batch size didn't help, but I saw that the validation fail on the last batch which was of size 1 instead of 2. When I removed the last image from the validation dataset, which made the dataset to include an even number of images, the training was successful. I didn't dive deep in to understand why a dataset of odd number of images in the validation raise an error but this solution is good enough for me for now. Thanks:)

Glad that you were able to find a good enough solution.