Can't train with fp16 on Nvidia P100
54696d21 opened this issue · comments
training with fp16 doesn't work for me on a P100, I'll look into fixing it, but for future reference here is the full stacktrace
torch version 1.9.0
2021-06-29 10:29:09.537741: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py:481: UserWarning: This DataLoader will create 8 worker processes in total. Our suggested max number of worker in current system is 4, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
cpuset_checked))
/usr/local/lib/python3.7/dist-packages/torch/functional.py:472: UserWarning: stft will soon require the return_complex parameter be given for real inputs, and will further require that return_complex=True in a future PyTorch release. (Triggered internally at /pytorch/aten/src/ATen/native/SpectralOps.cpp:664.)
normalized, onesided, return_complex)
Traceback (most recent call last):
File "train_ms.py", line 294, in <module>
main()
File "train_ms.py", line 50, in main
mp.spawn(run, nprocs=n_gpus, args=(n_gpus, hps,))
File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/content/vits/train_ms.py", line 118, in run
train_and_evaluate(rank, epoch, hps, [net_g, net_d], [optim_g, optim_d], [scheduler_g, scheduler_d], scaler, [train_loader, eval_loader], logger, [writer, writer_eval])
File "/content/vits/train_ms.py", line 192, in train_and_evaluate
scaler.scale(loss_gen_all).backward()
File "/usr/local/lib/python3.7/dist-packages/torch/_tensor.py", line 255, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/usr/local/lib/python3.7/dist-packages/torch/autograd/__init__.py", line 149, in backward
allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag
RuntimeError: "fill_cuda" not implemented for 'ComplexHalf'
the problem does not occur on torch version 1.6.0 as in the requirements.txt
Use nvcr.io/nvidia/pytorch:20.07-py
docker image.
I have the same issue, any idea of where the complex number is generated?,
(for 1.6 it works fine but I want to combine this with another code that requires pytorch 1.9)
I got the same issue. It’s due to a bug in the pytorch STFT function for half tensor. The work around is moving the calculation of y_hat_mel in train.py outside autocast, and casting y_hat to float one line above y_hat_mel calculation.
@boltzmann-Li Can you create a PR so we can see that fix. I haven't managed to get it working following your instructions.
FYI the problem hasn't been fixed in torch 1.10.0
Is there an issue for the Complex Half problem?
@boltzmann-Li Can you create a PR so we can see that fix. I haven't managed to get it working following your instructions.
FYI the problem hasn't been fixed in torch 1.10.0
Is there an issue for the Complex Half problem?
I created a pull request. It has been working for me with 3090 GPUs and torch 1.9
Very helpful @boltzmann-Li
here are the lines https://github.com/boltzmann-Li/vits/blob/5a1f4b7afb8a822f66c0ddc75bc959a44a57d035/train_ms.py#L156-L166
I think a better way to solve this problem is to wrap the torch.stft with autocast(enabled=off)
inside the mel_spectrogram_torch function. Here is the code:
def mel_spectrogram_torch(y, n_fft, num_mels, sampling_rate, hop_size, win_size, fmin, fmax, center=False):
if torch.min(y) < -1.:
print('min value is ', torch.min(y))
if torch.max(y) > 1.:
print('max value is ', torch.max(y))
global mel_basis, hann_window
dtype_device = str(y.dtype) + '_' + str(y.device)
fmax_dtype_device = str(fmax) + '_' + dtype_device
wnsize_dtype_device = str(win_size) + '_' + dtype_device
if fmax_dtype_device not in mel_basis:
mel = librosa_mel_fn(sampling_rate, n_fft, num_mels, fmin, fmax)
mel_basis[fmax_dtype_device] = torch.from_numpy(mel).to(dtype=y.dtype, device=y.device)
if wnsize_dtype_device not in hann_window:
hann_window[wnsize_dtype_device] = torch.hann_window(win_size).to(dtype=y.dtype, device=y.device)
y = torch.nn.functional.pad(y.unsqueeze(1), (int((n_fft-hop_size)/2), int((n_fft-hop_size)/2)), mode='reflect')
y = y.squeeze(1)
with autocast(enabled=False):
y = y.float()
spec = torch.stft(y, n_fft, hop_length=hop_size, win_length=win_size, window=hann_window[wnsize_dtype_device],
center=center, pad_mode='reflect', normalized=False, onesided=True)
spec = torch.sqrt(spec.pow(2).sum(-1) + 1e-6)
spec = torch.matmul(mel_basis[fmax_dtype_device], spec)
spec = spectral_normalize_torch(spec)
return spec