Can't train with fp16 on Nvidia P100

Question

Can't train with fp16 on Nvidia P100

54696d21 opened this issue 3 years ago · comments

training with fp16 doesn't work for me on a P100, I'll look into fixing it, but for future reference here is the full stacktrace
torch version 1.9.0

2021-06-29 10:29:09.537741: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py:481: UserWarning: This DataLoader will create 8 worker processes in total. Our suggested max number of worker in current system is 4, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
  cpuset_checked))
/usr/local/lib/python3.7/dist-packages/torch/functional.py:472: UserWarning: stft will soon require the return_complex parameter be given for real inputs, and will further require that return_complex=True in a future PyTorch release. (Triggered internally at  /pytorch/aten/src/ATen/native/SpectralOps.cpp:664.)
  normalized, onesided, return_complex)
Traceback (most recent call last):
  File "train_ms.py", line 294, in <module>
    main()
  File "train_ms.py", line 50, in main
    mp.spawn(run, nprocs=n_gpus, args=(n_gpus, hps,))
  File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/content/vits/train_ms.py", line 118, in run
    train_and_evaluate(rank, epoch, hps, [net_g, net_d], [optim_g, optim_d], [scheduler_g, scheduler_d], scaler, [train_loader, eval_loader], logger, [writer, writer_eval])
  File "/content/vits/train_ms.py", line 192, in train_and_evaluate
    scaler.scale(loss_gen_all).backward()
  File "/usr/local/lib/python3.7/dist-packages/torch/_tensor.py", line 255, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/usr/local/lib/python3.7/dist-packages/torch/autograd/__init__.py", line 149, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
RuntimeError: "fill_cuda" not implemented for 'ComplexHalf'

Tim · Answer 1 · Tue Jun 29 2021 18:41:54 GMT+0800 (China Standard Time)

the problem does not occur on torch version 1.6.0 as in the requirements.txt

sleim · Answer 2 · Wed Jun 30 2021 16:37:00 GMT+0800 (China Standard Time)

Use nvcr.io/nvidia/pytorch:20.07-py docker image.

Jesus Villalba · Answer 3 · Wed Aug 04 2021 00:54:49 GMT+0800 (China Standard Time)

I have the same issue, any idea of where the complex number is generated?,
(for 1.6 it works fine but I want to combine this with another code that requires pytorch 1.9)

boltzmann-Li · Answer 4 · Tue Oct 12 2021 11:08:58 GMT+0800 (China Standard Time)

I got the same issue. It’s due to a bug in the pytorch STFT function for half tensor. The work around is moving the calculation of y_hat_mel in train.py outside autocast, and casting y_hat to float one line above y_hat_mel calculation.

Harry Coultas Blum · Answer 5 · Mon Oct 25 2021 19:16:18 GMT+0800 (China Standard Time)

@boltzmann-Li Can you create a PR so we can see that fix. I haven't managed to get it working following your instructions.

FYI the problem hasn't been fixed in torch 1.10.0

Is there an issue for the Complex Half problem?

boltzmann-Li · Answer 6 · Thu Oct 28 2021 20:46:10 GMT+0800 (China Standard Time)

@boltzmann-Li Can you create a PR so we can see that fix. I haven't managed to get it working following your instructions.

FYI the problem hasn't been fixed in torch 1.10.0

Is there an issue for the Complex Half problem?

I created a pull request. It has been working for me with 3090 GPUs and torch 1.9

Faris Hijazi · Answer 7 · Fri Dec 31 2021 00:12:06 GMT+0800 (China Standard Time)

Very helpful @boltzmann-Li

here are the lines https://github.com/boltzmann-Li/vits/blob/5a1f4b7afb8a822f66c0ddc75bc959a44a57d035/train_ms.py#L156-L166

Yunchao He · Answer 8 · Thu Mar 31 2022 14:27:08 GMT+0800 (China Standard Time)

I think a better way to solve this problem is to wrap the torch.stft with autocast(enabled=off) inside the mel_spectrogram_torch function. Here is the code:

def mel_spectrogram_torch(y, n_fft, num_mels, sampling_rate, hop_size, win_size, fmin, fmax, center=False):
    if torch.min(y) < -1.:
        print('min value is ', torch.min(y))
    if torch.max(y) > 1.:
        print('max value is ', torch.max(y))

    global mel_basis, hann_window
    dtype_device = str(y.dtype) + '_' + str(y.device)
    fmax_dtype_device = str(fmax) + '_' + dtype_device
    wnsize_dtype_device = str(win_size) + '_' + dtype_device
    if fmax_dtype_device not in mel_basis:
        mel = librosa_mel_fn(sampling_rate, n_fft, num_mels, fmin, fmax)
        mel_basis[fmax_dtype_device] = torch.from_numpy(mel).to(dtype=y.dtype, device=y.device)
    if wnsize_dtype_device not in hann_window:
        hann_window[wnsize_dtype_device] = torch.hann_window(win_size).to(dtype=y.dtype, device=y.device)

    y = torch.nn.functional.pad(y.unsqueeze(1), (int((n_fft-hop_size)/2), int((n_fft-hop_size)/2)), mode='reflect')
    y = y.squeeze(1)
    with autocast(enabled=False):
        y = y.float()
        spec = torch.stft(y, n_fft, hop_length=hop_size, win_length=win_size, window=hann_window[wnsize_dtype_device],
                        center=center, pad_mode='reflect', normalized=False, onesided=True)

    spec = torch.sqrt(spec.pow(2).sum(-1) + 1e-6)

    spec = torch.matmul(mel_basis[fmax_dtype_device], spec)
    spec = spectral_normalize_torch(spec)

    return spec