Bug running fcn with mtat

Question

Bug running fcn with mtat

Sette opened this issue 3 years ago · comments

result = self.forward(*input, **kwargs)
RuntimeError: builtins: link error: Invalid value
The above operation failed in interpreter, with the following stack trace:

The above operation failed in interpreter, with the following stack trace:

Any idea what the problem is?

Minz Won · Answer 1 · Wed Jan 05 2022 00:12:27 GMT+0800 (China Standard Time)

Can you share the entire code that you run and the entire error message, please? With this, I can't understand which part returned the error.

Bruno Sette · Answer 2 · Wed Jan 05 2022 00:48:14 GMT+0800 (China Standard Time)

Namespace(batch_size=16, data_path='/home/bruno/data', dataset='mtat', log_step=20, lr=0.0001, model_load_path='.', model_save_path='./../models', model_type='fcn', n_epochs=200, num_workers=0, use_tensorboard=1)
Traceback (most recent call last):
File "main.py", line 59, in
main(config)
File "main.py", line 37, in main
solver.train()
File "/home/bruno/git/sota-music-tagging-models/training/solver.py", line 169, in train
out = self.model(x)
File "/home/bruno/anaconda3/envs/sota/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in call
result = self.forward(*input, **kwargs)
File "/home/bruno/git/sota-music-tagging-models/training/model.py", line 51, in forward
x = self.to_db(x)
File "/home/bruno/anaconda3/envs/sota/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in call
result = self.forward(*input, **kwargs)
RuntimeError: builtins: link error: Invalid value
The above operation failed in interpreter, with the following stack trace:

The above operation failed in interpreter, with the following stack trace:

Minz Won · Answer 3 · Wed Jan 05 2022 00:58:05 GMT+0800 (China Standard Time)

Can you add these three lines in solver.py before out = self.model(x) in line 169?

print(x.shape)
print(type(x))
print(type(x[0][0][0]))

Then please share what they return. It looks like an input error.
Also, please double-check if your library versions are identical to the requirements.txt.

Bruno Sette · Answer 4 · Wed Jan 05 2022 01:09:04 GMT+0800 (China Standard Time)

Output:
torch.Size([16, 464000])
<class 'torch.Tensor'>
Traceback (most recent call last):
File "main.py", line 59, in
main(config)
File "main.py", line 37, in main
solver.train()
File "/home/bruno/git/sota-music-tagging-models/training/solver.py", line 171, in train
print(type(x[0][0][0]))
IndexError: invalid index of a 0-dim tensor. Use tensor.item() to convert a 0-dim tensor to a Python number

About requirements. I have a bug with pytorch 1.2:
ERROR: Could not find a version that satisfies the requirement torch==1.2.0 (from -r requirements.txt (line 20)) (from versions: 1.4.0, 1.5.0, 1.5.1, 1.6.0, 1.7.0, 1.7.1, 1.8.0, 1.8.1, 1.9.0, 1.9.1, 1.10.0, 1.10.1)
ERROR: No matching distribution found for torch==1.2.0 (from -r requirements.txt (line 20))

Minz Won · Answer 5 · Wed Jan 05 2022 01:14:23 GMT+0800 (China Standard Time)

Okay, then remove those three lines from solver.py. And paste those three lines in model.py line 51 before x = self.to_db(x). What do they return?

Yeah, maybe 1.2.0 is too old. What is the version of your torchaudio?

Bruno Sette · Answer 6 · Wed Jan 05 2022 01:22:21 GMT+0800 (China Standard Time)

Output:
torch.Size([16, 96, 1813])
<class 'torch.Tensor'>
<class 'torch.Tensor'>

torchaudio version is 0.3.0

Minz Won · Answer 7 · Wed Jan 05 2022 01:35:15 GMT+0800 (China Standard Time)

Okay, the input shape looks fine.
There are two more reasons that I suspect.

Please check if the input includes Inf or NaN. Remove the previous 3 lines and paste the following.
print(np.isnan(x).any())
print(np.isinf(x).any())
Does this happen no matter you use CPU or GPU? Sometimes it returns invalid value error because of the CUDA configuration. Please check if this happens when you use your CPU.

Bruno Sette · Answer 8 · Wed Jan 05 2022 02:56:10 GMT+0800 (China Standard Time)

How i can run with CPU?

Minz Won · Answer 9 · Wed Jan 05 2022 09:26:04 GMT+0800 (China Standard Time)

You can control it in solver.py.

x.cpu() will send your input to CPU and self.model.cpu() will send your model to CPU. Try them in line 165.

Bruno Sette · Answer 10 · Wed Jan 05 2022 10:17:23 GMT+0800 (China Standard Time)

I run it with the CPU and it worked. I believe it is some configuration of cuda and CUDNN.

Minz Won · Answer 11 · Wed Jan 05 2022 16:11:08 GMT+0800 (China Standard Time)

Yes, then you need to check your CUDA configuration.