Problem in Training on multi-gpu

Question

Problem in Training on multi-gpu

paanguin opened this issue 4 years ago · comments

I saw you fixed an issue at #6, and I think this is not my problem.

I slightly fixed your code at the data loading for my convenience. Instead of using labels as separated files, I changed it to use a long list file.

Anyway, the problem appeared in the multi-gpu training.

The following training code with a single device (gpu) did not caused any problems.

CUDA_VISIBLE_DEVICES=0 python train.py --train-manifest-list ~/asr/data/librispeech/libri_asis --valid-manifest-list ~/asr/data/librispeech/libri_dev --test-manifest-list ~/asr/data/librispeech/libri_test_clean --labels-path data/labels/labels.json --cuda --device-ids 0 --parallel --save-every 1 --save-folder train_models/librispeech_transformer --name librispeech_transformer --warmup 8000 --epochs 10 --label-smoothing 0.15 --window-size 0.025 --window-stride 0.01 --window hann --lr 0.1 --feat_extractor None --num-layers 6 --dropout 0.2 --dim-inner 2048 --num-heads 8 --dim-input 201 --batch-size 4

However, when I run it with multiple gpus, I got error messages.

Here is the command,

CUDA_VISIBLE_DEVICES=0,1 python train.py --train-manifest-list ~/asr/data/librispeech/libri_asis --valid-manifest-list ~/asr/data/librispeech/libri_dev --test-manifest-list ~/asr/data/librispeech/libri_test_clean --labels-path data/labels/labels.json --cuda --device-ids 0 1 --parallel --save-every 1 --save-folder train_models/librispeech_transformer --name librispeech_transformer --warmup 8000 --epochs 10 --label-smoothing 0.15 --window-size 0.025 --window-stride 0.01 --window hann --lr 0.1 --feat_extractor None --num-layers 6 --dropout 0.2 --dim-inner 2048 --num-heads 8 --dim-input 201 --batch-size 8

and this is the error messages.

==================================================
THE EXPERIMENT LOG IS SAVED IN: log/librispeech_transformer
TRAINING MANIFEST: ['/home/hh1208-kang/asr/data/librispeech/libri_asis']
VALID MANIFEST: ['/home/hh1208-kang/asr/data/librispeech/libri_dev']
TEST MANIFEST: ['/home/hh1208-kang/asr/data/librispeech/libri_test_clean']
==================================================
the model is initialized without feature extractor
load with device_ids [0, 1]
0%| | 0/35156 [00:00<?, ?it/s]Traceback (most recent call last):
File "train.py", line 117, in
trainer.train(model, train_loader, train_sampler, valid_loader_list, opt, loss_type, start_epoch, num_epochs, label2id, id2label, metrics)
File "/home/hh1208-kang/end2end-asr-pytorch/trainer/asr/trainer.py", line 59, in train
src, src_lengths, tgt, verbose=False)
File "/home/hh1208-kang/venv/lib/python3.5/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "/home/hh1208-kang/end2end-asr-pytorch/utils/parallel.py", line 147, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/hh1208-kang/end2end-asr-pytorch/utils/parallel.py", line 190, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/hh1208-kang/venv/lib/python3.5/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
output.reraise()
File "/home/hh1208-kang/venv/lib/python3.5/site-packages/torch/_utils.py", line 394, in reraise
raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in replica 1 on device 1.
Original Traceback (most recent call last):
File "/home/hh1208-kang/venv/lib/python3.5/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
output = module(*input, **kwargs)
File "/home/hh1208-kang/venv/lib/python3.5/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "/home/hh1208-kang/end2end-asr-pytorch/utils/parallel.py", line 146, in forward
replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
File "/home/hh1208-kang/end2end-asr-pytorch/utils/parallel.py", line 184, in replicate
return replicate(module, device_ids)
File "/home/hh1208-kang/venv/lib/python3.5/site-packages/torch/nn/parallel/replicate.py", line 88, in replicate
param_copies = _broadcast_coalesced_reshape(params, devices, detach)
File "/home/hh1208-kang/venv/lib/python3.5/site-packages/torch/nn/parallel/replicate.py", line 71, in _broadcast_coalesced_reshape
tensor_copies = Broadcast.apply(devices, *tensors)
File "/home/hh1208-kang/venv/lib/python3.5/site-packages/torch/nn/parallel/_functions.py", line 21, in forward
outputs = comm.broadcast_coalesced(inputs, ctx.target_gpus)
File "/home/hh1208-kang/venv/lib/python3.5/site-packages/torch/cuda/comm.py", line 39, in broadcast_coalesced
return torch._C._broadcast_coalesced(tensors, devices, buffer_size)
RuntimeError: all tensors must be on devices[0]

My machine have 4 GTX TITAN GPUs. I'll be very grateful if you would give any suggestions or advises for me to solve this problem.

Genta Indra Winata · Answer 1 · Thu Jan 16 2020 14:20:20 GMT+0800 (China Standard Time)

thanks for the report. I am going to check the code.

rainmaker · Answer 2 · Tue Jan 28 2020 17:12:17 GMT+0800 (China Standard Time)

Same error here with

RuntimeError: all tensors must be on devices[0]

Samuel Cahyawijaya · Answer 3 · Mon Feb 03 2020 15:54:54 GMT+0800 (China Standard Time)

@paanguin @AIscientist : We have figured out the issue, we are currently fixing the code. We will get back into you in few days.

Genta Indra Winata · Answer 4 · Mon Feb 03 2020 17:23:15 GMT+0800 (China Standard Time)

@AIscientist @paanguin we have pushed the correction to develop branch. We will review the code and merge the branch to the master soon.

Samuel Cahyawijaya · Answer 5 · Mon Feb 03 2020 20:51:35 GMT+0800 (China Standard Time)

@AIscientist @paanguin we have updated the code on master branch, kindly pull the latest code on `master branch.