[BUG]: AttributeError: 'Tensor' object has no attribute 'append'
aleSuglia opened this issue · comments
Hi,
I just spotted a bug in the training script run_img2txt_dist.py
. Specifically, when running the code with multiple GPUs the following exception is raised:
Traceback (most recent call last):
File "vlp/run_guesswhat_dist.py", line 625, in <module>
main()
File "vlp/run_guesswhat_dist.py", line 543, in main
vqa2_loss.append(ans_loss.item())
AttributeError: 'Tensor' object has no attribute 'append'
Unfortunately, this is due to the fact that at line https://github.com/LuoweiZhou/VLP/blob/master/vlp/run_img2txt_dist.py#L542 you're overriding vqa2_loss
which will become a torch.Tensor
therefore the append
call at line 543 will break.
Changing line 542 to ans_loss = ans_loss.mean()
should fix the error.
@aleSuglia Yes, it should be ans_loss = ans_loss.mean()
, will fix, thanks for the catch!
Note that this part of the code has never been executed because we are using distributed data parallel (see the example on COCO here). The code has not been tested on the regular data parallel (i.e., n_gpu>1
) which is slower than dist data parallel. We'd suggest using the dist one. If for some reason you prefer using the regular one, pls expect some rough edges and use with your own discretion.
Thanks a lot for your answer @LuoweiZhou. Yeah that makes sense. Do I have to specify anything in particular to use multiple GPUs? By looking at the code it looks like I only need to make sure that the program is able to "see" multiple devices. Is that correct?
Yes, in the 2-GPU example, you can specify CUDA_VISIBLE_DEVICES=0, 1
for both commands if you want.