Failure during Multi-GPU evaluation

Question

Failure during Multi-GPU evaluation

danielvais opened this issue 7 months ago · comments

Hi,
I'm encountering an error in the first eval epoch. The error I get is:

I am running the gupnet model training:

CUDA_VISIBLE_DEVICES=0,1 python -u tools/train_val.py --config=experiments/run_221.yaml

I was successful training the model on a sub dataset of only 300 images. The error appears what I train the full dataset.
Any suggestions?

Abhinav Kumar | अभिनव कुमार · Answer 1 · Thu Jan 18 2024 02:13:18 GMT+0800 (China Standard Time)

Hi @danielvais
Thank you for your interest in DEVIANT again. Here are a couple of things I would try:

The error appears what I train the full dataset.

Try evaluation in single GPU setting:

CUDA_VISIBLE_DEVICES=0 python -u tools/train_val.py --config=experiments/run_221.yaml --resume_model=... -e

Also check if the val batch size is large for the available GPU memory. I see that you use a batch size of 6 . You could try changing this line to make batch size as 2.

danielvais · Answer 2 · Sun Jan 21 2024 20:55:20 GMT+0800 (China Standard Time)

Hi @abhi1kumar
As you advised in my previous issue, I can't train the full dataset on a single GPU due to lack of memory.

Reducing the batch size didn't help, but I saw that the validation fail on the last batch which was of size 1 instead of 2. When I removed the last image from the validation dataset, which made the dataset to include an even number of images, the training was successful. I didn't dive deep in to understand why a dataset of odd number of images in the validation raise an error but this solution is good enough for me for now. Thanks:)

Abhinav Kumar | अभिनव कुमार · Answer 3 · Mon Jan 22 2024 04:13:56 GMT+0800 (China Standard Time)

Glad that you were able to find a good enough solution.