ZhikangNiu / encodec-pytorch

unofficial implementation of the High Fidelity Neural Audio Compression

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

RuntimeError: one of the variables needed for gradient computation has been modified

sjjbsj opened this issue · comments

commented

Hi,

I am writing to seek your advice on an issue I am experiencing during backpropagation of my model. Specifically, I am encountering an error in the loss function after the warmup stage and am unsure how to proceed. Maybe enter if config.model.train_discriminator and epoch > config.lr_scheduler.warmup_epoch:

I would greatly appreciate any guidance or suggestions you may have to help me address this problem.

log:
Error executing job with overrides: ['distributed.torch_distributed_debug=False', 'distributed.find_unused_parameters=True', 'distributed.world_size=2', 'common.max_epoch=15', 'datasets.tensor_cut=8000', 'datasets.batch_size=40', 'datasets.train_csv_path=/home/anna.peng/PycharmProjects/encodec-pytorch-main/librispeech_train100h_anna.csv', 'lr_scheduler.warmup_epoch=2', 'optimization.lr=1e-4', 'optimization.disc_lr=1e-4']
Traceback (most recent call last):
File "train_multi_gpu.py", line 258, in main
join=True
File "/home/anna.peng/.local/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/anna.peng/.local/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
while not context.join():
File "/home/anna.peng/.local/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 160, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/home/anna.peng/.local/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/home/anna.peng/PycharmProjects/encodec-pytorch-main/train_multi_gpu.py", line 209, in train
scheduler,disc_scheduler)
File "/home/anna.peng/PycharmProjects/encodec-pytorch-main/train_multi_gpu.py", line 59, in train_one_step
loss.backward()
File "/home/anna.peng/.local/lib/python3.7/site-packages/torch/_tensor.py", line 489, in backward
self, gradient, retain_graph, create_graph, inputs=inputs
File "/home/anna.peng/.local/lib/python3.7/site-packages/torch/autograd/init.py", line 199, in backward
allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to run the backward pass
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [1, 32, 3, 3]] is at version 2; expected version 1 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

emmm, I also have encountered this problem. Maybe you can check the code is lastest? besides, I will check the problem tomorrow. you can also use old version by check the commit history. The problem maybe caused by my recent changes about scheduler.

can you share your environment?

I think you can debug step by step to understand the specific problem

I have used latest code in two RTX 3090 to test the multi GPU training and one RTX 3090 to test the single GPU training. And I'm not encounter this bug. Here are my code setup.

  • Multi GPU training
    3090 *2 , torch= 2.0.0 , the changed config parameters is listed as follows:
     distributed.torch_distributed_debug=False
     distributed.find_unused_parameters=True
     distributed.world_size=2
     common.save_interval=5
     common.test_interval=2
     common.max_epoch=100
     datasets.fixed_length=1000
     datasets.train_csv_path=/mnt/lustre/sjtu/home/zkn02/train_encodec/datasets/libritts_train100h.csv
     datasets.tensor_cut=100000
     datasets.batch_size=6
     lr_scheduler.warmup_epoch=2
     optimization.lr=1e-5
     optimization.disc_lr=1e-5

the log :

2023-06-02 12:48:15,326: INFO: [train_multi_gpu.py: 119]: {'common': {'save_interval': 5, 'test_interval': 2, 'max_epoch': 100, 'seed': 3401, 'amp': False}, 'datasets': {'train_csv_path': '/mnt/lustre/sjtu/home/zkn02/train_encodec/datasets/libritts_train100h.csv', 'test_csv_path': '/mnt/lustre/sjtu/home/zkn02/train_encodec/datasets/LibriTTS_dev-other.csv', 'batch_size': 6, 'tensor_cut': 100000, 'num_workers': 0, 'fixed_length': 1000, 'pin_memory': True}, 'checkpoint': {'resume': False, 'checkpoint_path': '', 'disc_checkpoint_path': '', 'save_folder': './checkpoints/', 'save_location': '${checkpoint.save_folder}batch${datasets.batch_size}_cut${datasets.tensor_cut}_length${datasets.fixed_length}_'}, 'optimization': {'lr': 1e-05, 'disc_lr': 1e-05}, 'lr_scheduler': {'warmup_epoch': 2}, 'model': {'target_bandwidths': [1.5, 3.0, 6.0, 12.0, 24.0], 'sample_rate': 24000, 'channels': 1, 'train_discriminator': True, 'audio_normalize': True, 'filters': 32}, 'distributed': {'data_parallel': True, 'world_size': 2, 'find_unused_parameters': True, 'torch_distributed_debug': False}}
2023-06-02 12:48:15,331: INFO: [train_multi_gpu.py: 120]: Encodec Model Parameters: 14855843
2023-06-02 12:48:15,331: INFO: [train_multi_gpu.py: 121]: Disc Model Parameters: 283398
2023-06-02 12:48:15,331: INFO: [train_multi_gpu.py: 122]: model train mode :True | quantizer train mode :True 
2023-06-02 12:48:15,593: INFO: [train_multi_gpu.py: 119]: {'common': {'save_interval': 5, 'test_interval': 2, 'max_epoch': 100, 'seed': 3401, 'amp': False}, 'datasets': {'train_csv_path': '/mnt/lustre/sjtu/home/zkn02/train_encodec/datasets/libritts_train100h.csv', 'test_csv_path': '/mnt/lustre/sjtu/home/zkn02/train_encodec/datasets/LibriTTS_dev-other.csv', 'batch_size': 6, 'tensor_cut': 100000, 'num_workers': 0, 'fixed_length': 1000, 'pin_memory': True}, 'checkpoint': {'resume': False, 'checkpoint_path': '', 'disc_checkpoint_path': '', 'save_folder': './checkpoints/', 'save_location': '${checkpoint.save_folder}batch${datasets.batch_size}_cut${datasets.tensor_cut}_length${datasets.fixed_length}_'}, 'optimization': {'lr': 1e-05, 'disc_lr': 1e-05}, 'lr_scheduler': {'warmup_epoch': 2}, 'model': {'target_bandwidths': [1.5, 3.0, 6.0, 12.0, 24.0], 'sample_rate': 24000, 'channels': 1, 'train_discriminator': True, 'audio_normalize': True, 'filters': 32}, 'distributed': {'data_parallel': True, 'world_size': 2, 'find_unused_parameters': True, 'torch_distributed_debug': False}}
2023-06-02 12:48:15,594: INFO: [train_multi_gpu.py: 120]: Encodec Model Parameters: 14855843
2023-06-02 12:48:15,595: INFO: [train_multi_gpu.py: 121]: Disc Model Parameters: 283398
2023-06-02 12:48:15,595: INFO: [train_multi_gpu.py: 122]: model train mode :True | quantizer train mode :True 
2023-06-02 12:48:15,708: INFO: [distributed_c10d.py: 432]: Added key: store_based_barrier_key:1 to store for rank: 1
2023-06-02 12:48:15,716: INFO: [distributed_c10d.py: 432]: Added key: store_based_barrier_key:1 to store for rank: 0
2023-06-02 12:48:15,717: INFO: [distributed_c10d.py: 466]: Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
2023-06-02 12:48:15,733: INFO: [distributed_c10d.py: 466]: Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
2023-06-02 12:49:27,759: INFO: [train_multi_gpu.py: 65]: | epoch: 1 | loss: 33.31180191040039 | loss_g: 31.2178897857666 | loss_w: 2.093912363052368 | lr: 1.0000000000000001e-07 | disc_lr: 1.0000000000000001e-07
2023-06-02 12:50:30,795: INFO: [train_multi_gpu.py: 65]: | epoch: 2 | loss: 18.22700309753418 | loss_g: 17.53032684326172 | loss_w: 0.6966761350631714 | lr: 9.990754267376514e-06 | disc_lr: 9.990754267376514e-06
2023-06-02 12:51:12,410: INFO: [train_multi_gpu.py: 83]: | TEST | epoch: 2 | loss_g: 16.217060089111328 | loss_disc: 1.999979853630066
2023-06-02 12:52:35,755: INFO: [train_multi_gpu.py: 65]: | epoch: 3 | loss: 11.52556324005127 | loss_g: 11.375588417053223 | loss_w: 0.1499750018119812 | lr: 9.979206008271393e-06 | disc_lr: 9.979206008271393e-06
2023-06-02 12:52:35,756: INFO: [train_multi_gpu.py: 67]: | loss_disc: 1.9994720220565796
2023-06-02 12:53:58,440: INFO: [train_multi_gpu.py: 65]: | epoch: 4 | loss: 12.130687713623047 | loss_g: 12.015471458435059 | loss_w: 0.11521672457456589 | lr: 9.963055062204609e-06 | disc_lr: 9.963055062204609e-06
2023-06-02 12:53:58,440: INFO: [train_multi_gpu.py: 67]: | loss_disc: 1.9973113536834717
2023-06-02 12:54:26,933: INFO: [train_multi_gpu.py: 83]: | TEST | epoch: 4 | loss_g: 10.545166969299316 | loss_disc: 1.9967026710510254
2023-06-02 12:55:49,203: INFO: [train_multi_gpu.py: 65]: | epoch: 5 | loss: 10.983535766601562 | loss_g: 10.894118309020996 | loss_w: 0.08941741287708282 | lr: 9.942318025365027e-06 | disc_lr: 9.942318025365027e-06
2023-06-02 12:55:49,204: INFO: [train_multi_gpu.py: 67]: | loss_disc: 1.9918556213378906
2023-06-02 12:57:13,010: INFO: [train_multi_gpu.py: 65]: | epoch: 6 | loss: 10.523397445678711 | loss_g: 10.419259071350098 | loss_w: 0.10413823276758194 | lr: 9.917016206459796e-06 | disc_lr: 9.917016206459796e-06
2023-06-02 12:57:13,011: INFO: [train_multi_gpu.py: 67]: | loss_disc: 1.9791228771209717
2023-06-02 12:57:40,469: INFO: [train_multi_gpu.py: 83]: | TEST | epoch: 6 | loss_g: 9.559894561767578 | loss_disc: 1.9766931533813477
2023-06-02 12:59:04,421: INFO: [train_multi_gpu.py: 65]: | epoch: 7 | loss: 9.345637321472168 | loss_g: 9.30517292022705 | loss_w: 0.040464136749506 | lr: 9.887175604818207e-06 | disc_lr: 9.887175604818207e-06
2023-06-02 12:59:04,422: INFO: [train_multi_gpu.py: 67]: | loss_disc: 1.9636516571044922
2023-06-02 13:00:27,272: INFO: [train_multi_gpu.py: 65]: | epoch: 8 | loss: 9.28451919555664 | loss_g: 9.215991973876953 | loss_w: 0.06852763891220093 | lr: 9.852826883675634e-06 | disc_lr: 9.852826883675634e-06
2023-06-02 13:00:27,273: INFO: [train_multi_gpu.py: 67]: | loss_disc: 1.9527044296264648
2023-06-02 13:00:54,698: INFO: [train_multi_gpu.py: 83]: | TEST | epoch: 8 | loss_g: 8.622124671936035 | loss_disc: 1.9357608556747437
2023-06-02 13:02:17,948: INFO: [train_multi_gpu.py: 65]: | epoch: 9 | loss: 6.146459579467773 | loss_g: 6.11637544631958 | loss_w: 0.03008396551012993 | lr: 9.814005338664973e-06 | disc_lr: 9.814005338664973e-06
2023-06-02 13:02:17,949: INFO: [train_multi_gpu.py: 67]: | loss_disc: 1.9932515621185303
2023-06-02 13:03:42,270: INFO: [train_multi_gpu.py: 65]: | epoch: 10 | loss: 6.795133113861084 | loss_g: 6.749742031097412 | loss_w: 0.04539122059941292 | lr: 9.77075086154801e-06 | disc_lr: 9.77075086154801e-06
2023-06-02 13:03:42,271: INFO: [train_multi_gpu.py: 67]: | loss_disc: 1.9943803548812866
2023-06-02 13:04:09,624: INFO: [train_multi_gpu.py: 83]: | TEST | epoch: 10 | loss_g: 6.215439796447754 | loss_disc: 1.9836242198944092
  • Single GPU training
    one RTX 3090, torch= 2.0.0
     distributed.data_parallel=False
     common.save_interval=5
     common.test_interval=2
     common.max_epoch=100
     datasets.fixed_length=1000
     datasets.train_csv_path=/mnt/lustre/sjtu/home/zkn02/train_encodec/datasets/libritts_train100h.csv
     datasets.tensor_cut=100000
     datasets.batch_size=6
     lr_scheduler.warmup_epoch=2
     optimization.lr=1e-5
     optimization.disc_lr=1e-5

the log:

2023-06-02 13:08:02,786: INFO: [train_multi_gpu.py: 119]: {'common': {'save_interval': 5, 'test_interval': 2, 'max_epoch': 100, 'seed': 3401, 'amp': False}, 'datasets': {'train_csv_path': '/mnt/lustre/sjtu/home/zkn02/train_encodec/datasets/libritts_train100h.csv', 'test_csv_path': '/mnt/lustre/sjtu/home/zkn02/train_encodec/datasets/LibriTTS_dev-other.csv', 'batch_size': 6, 'tensor_cut': 100000, 'num_workers': 0, 'fixed_length': 1000, 'pin_memory': True}, 'checkpoint': {'resume': False, 'checkpoint_path': '', 'disc_checkpoint_path': '', 'save_folder': './checkpoints/', 'save_location': '${checkpoint.save_folder}batch${datasets.batch_size}_cut${datasets.tensor_cut}_length${datasets.fixed_length}_'}, 'optimization': {'lr': 1e-05, 'disc_lr': 1e-05}, 'lr_scheduler': {'warmup_epoch': 2}, 'model': {'target_bandwidths': [1.5, 3.0, 6.0, 12.0, 24.0], 'sample_rate': 24000, 'channels': 1, 'train_discriminator': True, 'audio_normalize': True, 'filters': 32}, 'distributed': {'data_parallel': False, 'world_size': 4, 'find_unused_parameters': True, 'torch_distributed_debug': False}}
2023-06-02 13:08:02,791: INFO: [train_multi_gpu.py: 120]: Encodec Model Parameters: 14855843
2023-06-02 13:08:02,792: INFO: [train_multi_gpu.py: 121]: Disc Model Parameters: 283398
2023-06-02 13:08:02,792: INFO: [train_multi_gpu.py: 122]: model train mode :True | quantizer train mode :True 
2023-06-02 13:10:11,058: INFO: [train_multi_gpu.py: 65]: | epoch: 1 | loss: 35.07368850708008 | loss_g: 33.48434066772461 | loss_w: 1.5893492698669434 | lr: 1.0000000000000001e-07 | disc_lr: 1.0000000000000001e-07
2023-06-02 13:12:14,356: INFO: [train_multi_gpu.py: 65]: | epoch: 2 | loss: 8.310249328613281 | loss_g: 8.114068984985352 | loss_w: 0.1961808204650879 | lr: 9.990754267376514e-06 | disc_lr: 9.990754267376514e-06
2023-06-02 13:13:08,171: INFO: [train_multi_gpu.py: 83]: | TEST | epoch: 2 | loss_g: 12.444842338562012 | loss_disc: 1.999990463256836
2023-06-02 13:15:45,271: INFO: [train_multi_gpu.py: 65]: | epoch: 3 | loss: 13.03136157989502 | loss_g: 12.901001930236816 | loss_w: 0.13035933673381805 | lr: 9.979206008271393e-06 | disc_lr: 9.979206008271393e-06
2023-06-02 13:15:45,272: INFO: [train_multi_gpu.py: 67]: | loss_disc: 1.9959542751312256
2023-06-02 13:18:24,182: INFO: [train_multi_gpu.py: 65]: | epoch: 4 | loss: 10.319283485412598 | loss_g: 10.254348754882812 | loss_w: 0.06493447721004486 | lr: 9.963055062204609e-06 | disc_lr: 9.963055062204609e-06
2023-06-02 13:18:24,182: INFO: [train_multi_gpu.py: 67]: | loss_disc: 1.9799656867980957
2023-06-02 13:19:17,341: INFO: [train_multi_gpu.py: 83]: | TEST | epoch: 4 | loss_g: 8.972946166992188 | loss_disc: 1.9816977977752686
2023-06-02 13:21:53,971: INFO: [train_multi_gpu.py: 65]: | epoch: 5 | loss: 7.428076267242432 | loss_g: 7.343043804168701 | loss_w: 0.08503223955631256 | lr: 9.942318025365027e-06 | disc_lr: 9.942318025365027e-06
2023-06-02 13:21:53,973: INFO: [train_multi_gpu.py: 67]: | loss_disc: 1.9636757373809814

My Best suggestion is make sure your torch version is samed as my torch version and use the lateset code or older code which I have not add more WarmupScheduler , Maybe there are some change in torch1.x and torch 2.x. Good luck to you. I hope this can help you @sjjbsj

if you have another question, please contact me and I will close this issue

commented

Thank you for your reply. I tried not to use the newest WarmupScheduler optimizer, but still have this error.
Sadlly, my cuda version is 11.2 and torch2.0.0 can not be installed. Can you guide me to some directions? For example, the stft function you mentioned

Thank you for your reply. I tried not to use the newest WarmupScheduler optimizer, but still have this error. Sadlly, my cuda version is 11.2 and torch2.0.0 can not be installed. Can you guide me to some directions? For example, the stft function you mentioned

maybe you can add more info about your environment, it can helps me to test the code.

commented

tool:
python3.7, torch1.13, two T4 GPUs
command:
CUDA_VISIBLE_DEVICES=0,1 python3 train_multi_gpu.py
distributed.torch_distributed_debug=True
distributed.find_unused_parameters=True
distributed.world_size=2
common.max_epoch=10
datasets.tensor_cut=8000
datasets.batch_size=32
datasets.train_csv_path=/home/anna.peng/PycharmProjects/encodec-pytorch-main/librispeech_train100h.csv
lr_scheduler.warmup_epoch=2
optimization.lr=1e-4
optimization.disc_lr=1e-4
some DDP info:
[I logger.cpp:213] [Rank 0]: DDP Initialized with:
broadcast_buffers: 0
bucket_cap_bytes: 26214400
find_unused_parameters: 1
gradient_as_bucket_view: 0
has_sync_bn: 0
is_multi_device_module: 0
iteration: 0
num_parameter_tensors: 160
output_device: 0
rank: 0
total_parameter_size_bytes: 59423372
world_size: 2
backend_name: nccl
bucket_sizes: 2671180, 27041280, 27836928, 1873984
cuda_visible_devices: 0,1
device_ids: 0
dtypes: float
master_addr: localhost
master_port: 12455
module_name: EncodecModel
nccl_async_error_handling: N/A
nccl_blocking_wait: N/A
nccl_debug: N/A
nccl_ib_timeout: N/A
nccl_nthreads: N/A
nccl_socket_ifname: N/A
torch_distributed_debug: DETAIL

[I logger.cpp:213] [Rank 1]: DDP Initialized with:
broadcast_buffers: 0
bucket_cap_bytes: 26214400
find_unused_parameters: 1
gradient_as_bucket_view: 0
has_sync_bn: 0
is_multi_device_module: 0
iteration: 0
num_parameter_tensors: 160
output_device: 1
rank: 1
total_parameter_size_bytes: 59423372
world_size: 2
backend_name: nccl
bucket_sizes: 2671180, 27041280, 27836928, 1873984
cuda_visible_devices: 0,1
device_ids: 1
dtypes: float
master_addr: localhost
master_port: 12455
module_name: EncodecModel
nccl_async_error_handling: N/A
nccl_blocking_wait: N/A
nccl_debug: N/A
nccl_ib_timeout: N/A
nccl_nthreads: N/A
nccl_socket_ifname: N/A
torch_distributed_debug: DETAIL

[I reducer.cpp:126] Reducer initialized with bucket_bytes_cap: 26214400 first_bucket_bytes_cap: 1048576
[I reducer.cpp:126] Reducer initialized with bucket_bytes_cap: 26214400 first_bucket_bytes_cap: 1048576
[I logger.cpp:213] [Rank 1]: DDP Initialized with:
broadcast_buffers: 0
bucket_cap_bytes: 26214400
find_unused_parameters: 1
gradient_as_bucket_view: 0
has_sync_bn: 0
is_multi_device_module: 0
iteration: 0
num_parameter_tensors: 51
output_device: 1
rank: 1
total_parameter_size_bytes: 1133592
world_size: 2
backend_name: nccl
bucket_sizes: 38280, 1095312
cuda_visible_devices: 0,1
device_ids: 1
dtypes: float
master_addr: localhost
master_port: 12455
module_name: MultiScaleSTFTDiscriminator
nccl_async_error_handling: N/A
nccl_blocking_wait: N/A
nccl_debug: N/A
nccl_ib_timeout: N/A
nccl_nthreads: N/A
nccl_socket_ifname: N/A
torch_distributed_debug: DETAIL

[I logger.cpp:213] [Rank 0]: DDP Initialized with:
broadcast_buffers: 0
bucket_cap_bytes: 26214400
find_unused_parameters: 1
gradient_as_bucket_view: 0
has_sync_bn: 0
is_multi_device_module: 0
iteration: 0
num_parameter_tensors: 51
output_device: 0
rank: 0
total_parameter_size_bytes: 1133592
world_size: 2
backend_name: nccl
bucket_sizes: 38280, 1095312
cuda_visible_devices: 0,1
device_ids: 0
dtypes: float
master_addr: localhost
master_port: 12455
module_name: MultiScaleSTFTDiscriminator
nccl_async_error_handling: N/A
nccl_blocking_wait: N/A
nccl_debug: N/A
nccl_ib_timeout: N/A
nccl_nthreads: N/A
nccl_socket_ifname: N/A
torch_distributed_debug: DETAIL

When the epoch is greater than lr_scheduler.warmup_epoch, the above error will be thrown. Please let me know if I am missing any information.

emmm, I will constract a same environment with you to test the code. Maybe you can use the commit 56de473b9e1981b36e7a9276034af70e9036b4fb to test the code? At that commit, I use two scheduler to train the model seperately.

Besides, I think you can debug step by step to understand the specific problem. It can helps us to solve the problem @sjjbsj

hello @sjjbsj ! I test the code in torch 1.13.0, I found there are the bugs samed as you. I guess you can change the code as follows:

    model.train()
    disc_model.train()
    for input_wav in tqdm(trainloader):
        # warmup learning rate, warmup_epoch is defined in config file,default is 5
        input_wav = input_wav.cuda() #[B, 1, T]: eg. [2, 1, 203760]
        optimizer.zero_grad()
        optimizer_disc.zero_grad()
        output, loss_w, _ = model(input_wav) #output: [B, 1, T]: eg. [2, 1, 203760] | loss_w: [1] 
        logits_real, fmap_real = disc_model(input_wav)
        logits_fake, fmap_fake = disc_model(output)
        loss_g = total_loss(fmap_real, logits_fake, fmap_fake, input_wav, output) 
        loss = loss_g + loss_w
        loss.backward()
        optimizer.step()
        scheduler.step()
        # train discriminator when epoch > warmup_epoch and train_discriminator is True
        if config.model.train_discriminator and epoch > config.lr_scheduler.warmup_epoch:
            logits_fake, _ = disc_model(output.detach()) # detach to avoid backpropagation to model
            loss_disc = disc_loss([logit_real.detach() for logit_real in logits_real], logits_fake) # compute discriminator loss
            loss_disc.backward() 
            optimizer_disc.step()
            disc_scheduler.step()
        

also you can find the problems in NVlabs/FUNIT#23

I will test the code's results, if perform well, I will merge the issue6 branch to main

commented

Thanks , the problem has been successfully resolved!