Are my val_loss values valid?

Question

Are my val_loss values valid?

steven8274 opened this issue a year ago · comments

Hi Nils, thanks for your great job in Deep Noise Suppression.I met a traning problem that confused me.
I followed the traning steps in 'README.md' to train the DTLN model, but the val_loss values I got after steps are always positive numbers around 45.I found that all the val_loss values people talked about here are always negtive numbers around -16.Anything wrong with me?I set the training set and validation set file path as:

path_to_train_mix = '/home/xxx/DNS-Challenge/training_set/train/noisy'
path_to_train_speech = '/home/xxx/DNS-Challenge/training_set/train/clean'
path_to_val_mix = '/home/xxx/DNS-Challenge/training_set/val/noisy'
path_to_val_speech = '/home/xxx/DNS-Challenge/training_set/val/clean'

My traing logs as:

None
Epoch 1/200
2023-04-12 15:02:54.045877: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
02023-04-12 15:03:46.368859: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2023-04-12 15:05:01.907399: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2023-04-12 15:10:02.601697: W tensorflow/stream_executor/gpu/redzone_allocator.cc:314] Internal: ptxas exited with non-zero error code 65280, output: ptxas fatal   : Value 'sm_86' is not defined for option 'gpu-name'

Relying on driver to perform ptx compilation. 
Modify $PATH to customize ptxas location.
This message will be only logged once.
3000/3000 [==============================] - ETA: 0s - loss: 0.0015       
Epoch 00001: val_loss improved from inf to 45.35575, saving model to ./models_DTLN_model/DTLN_model.h5
3000/3000 [==============================] - 1332s 444ms/step - loss: 0.0015 - val_loss: 45.3558 - lr: 0.0010
Epoch 2/200
3000/3000 [==============================] - ETA: 0s - loss: 0.0049   
Epoch 00002: val_loss did not improve from 45.35575
3000/3000 [==============================] - 1333s 444ms/step - loss: 0.0049 - val_loss: 45.4326 - lr: 0.0010
Epoch 3/200
3000/3000 [==============================] - ETA: 0s - loss: 0.0143   
Epoch 00003: val_loss did not improve from 45.35575
3000/3000 [==============================] - 1336s 445ms/step - loss: 0.0143 - val_loss: 45.4434 - lr: 0.0010
Epoch 4/200
3000/3000 [==============================] - ETA: 0s - loss: 0.0434   
Epoch 00004: val_loss improved from 45.35575 to 42.06635, saving model to ./models_DTLN_model/DTLN_model.h5
3000/3000 [==============================] - 1329s 443ms/step - loss: 0.0434 - val_loss: 42.0663 - lr: 0.0010
Epoch 5/200
3000/3000 [==============================] - ETA: 0s - loss: 0.0482   
Epoch 00005: val_loss did not improve from 42.06635
3000/3000 [==============================] - 1332s 444ms/step - loss: 0.0482 - val_loss: 43.5876 - lr: 0.0010
Epoch 6/200
3000/3000 [==============================] - ETA: 0s - loss: 0.0778   
Epoch 00006: val_loss did not improve from 42.06635
3000/3000 [==============================] - 1329s 443ms/step - loss: 0.0778 - val_loss: 46.5396 - lr: 0.0010
Epoch 7/200
3000/3000 [==============================] - ETA: 0s - loss: 0.0847   
Epoch 00007: val_loss did not improve from 42.06635
3000/3000 [==============================] - 1328s 443ms/step - loss: 0.0847 - val_loss: 46.8308 - lr: 0.0010

steven · Answer 1 · Wed Apr 12 2023 18:42:26 GMT+0800 (China Standard Time)

2023-04-12 15:10:02.601697: W tensorflow/stream_executor/gpu/redzone_allocator.cc:314] Internal: ptxas exited with non-zero error code 65280, output: ptxas fatal   : Value 'sm_86' is not defined for option 'gpu-name'

Relying on driver to perform ptx compilation. 
Modify $PATH to customize ptxas location.
This message will be only logged once.

Does this error make the training course invalid?

steven · Answer 2 · Wed Apr 12 2023 22:06:45 GMT+0800 (China Standard Time)

2023-04-12 15:10:02.601697: W tensorflow/stream_executor/gpu/redzone_allocator.cc:314] Internal: ptxas exited with non-zero error code 65280, output: ptxas fatal   : Value 'sm_86' is not defined for option 'gpu-name'

Relying on driver to perform ptx compilation. 
Modify $PATH to customize ptxas location.
This message will be only logged once.

Does this error make the training course invalid?

That's the reason!
I used a RTX 3060Ti GPU,which is not compatible with CUDA 10.1.When I change CUDA version to 11.2, and TensorFlow version to 2.5.0,the val_loss goes to negtive now.