Silence / Background Noise similarity
Tomas1337 opened this issue · comments
I've been having fun playing with your pre-trained model and implementation!
I've noticed a phenomena that could be a point of improvement. When you record silence or background noise, and extract the features from that, say silent_features
. It has a strong cosine_similarity to anything. I was wondering whether if you train the model and include various background noises / silence on the train_set and label them all silent_features
, it would learn to predict various silent_features
and distinguish it from voices.
Happy to hear that!
So from what I can say, the model was trained on clean speech without silence nor background noise. So technically, the model has only heard clear voices so far. If I can draw a parallel with a simple cat/dog classifier, it would be like showing a car to the model. It would either predict a cat or a dog.
if you train the model and include various background noises / silence on the train_set and label them all silent_features, it would learn to predict various silent_features and distinguish it from voices.
Yes it's true. I'm sure the model can be smart enough to learn this too.
Hello!
I've taken the repo/dataset and combined it with the Voxceleb2 dataset (6112 speakers). I also added a 'speaker' that was composed of a bunch of noise/silence samples. After I processed the voxceleb data into the same format (flac, 16khz, 24bit samples) as the librispeech data, I made another pass over both datasets, and for every utterance, I created 2 new training examples that were combined with random noise selected from https://github.com/microsoft/MS-SNSD . That resulted in around 730gb of training data. I've added 1k speakers to the initial classifier/softmax training and am currently running that training. Once it's complete I'll complete the triplet loss training and share the code/weights. I'm running it on a 2080ti, with 64gb of RAM, and I needed a bit over 200gb of swap space to keep the OOM killer at bay. An epoch is currently taking slightly over 1 hour.
Talk to you in a week or two :)
@w1nk AWESOME! Please let us know how it goes :)
Just an update:
I ended up needing to switch versions of tensorflow (switched to 2.3), 2.2 has a nasty memory leak that was getting triggered. Once I got things running stably, the softmax network converged / I early stopped it at epoch 38 and started training the triplet loss. That network is currently still training, but is getting close:
2000/2000 [==============================] - 815s 407ms/step - batch: 999.5000 - size: 192.0000 - loss: 0.0230 - val_loss: 0.0221
Epoch 336/1000
2000/2000 [==============================] - 814s 407ms/step - batch: 999.5000 - size: 192.0000 - loss: 0.0229 - val_loss: 0.0228
Epoch 337/1000
2000/2000 [==============================] - 814s 407ms/step - batch: 999.5000 - size: 192.0000 - loss: 0.0227 - val_loss: 0.0227
Epoch 338/1000
2000/2000 [==============================] - 813s 407ms/step - batch: 999.5000 - size: 192.0000 - loss: 0.0226 - val_loss: 0.0219
Epoch 339/1000
2000/2000 [==============================] - 814s 407ms/step - batch: 999.5000 - size: 192.0000 - loss: 0.0227 - val_loss: 0.0218
Epoch 340/1000
2000/2000 [==============================] - 814s 407ms/step - batch: 999.5000 - size: 192.0000 - loss: 0.0221 - val_loss: 0.0219
Epoch 341/1000
2000/2000 [==============================] - 814s 407ms/step - batch: 999.5000 - size: 192.0000 - loss: 0.0223 - val_loss: 0.0216
Epoch 342/1000
2000/2000 [==============================] - 814s 407ms/step - batch: 999.5000 - size: 192.0000 - loss: 0.0222 - val_loss: 0.0215
Looks like it's fitting nicely, testing some of the later epochs look pretty good as well. I'll find somewhere to put the checkpoints and a couple of the preparation scripts.
@w1nk very cool!
Just an update:
I ended up needing to switch versions of tensorflow (switched to 2.3), 2.2 has a nasty memory leak that was getting triggered. Once I got things running stably, the softmax network converged / I early stopped it at epoch 38 and started training the triplet loss. That network is currently still training, but is getting close:
2000/2000 [==============================] - 815s 407ms/step - batch: 999.5000 - size: 192.0000 - loss: 0.0230 - val_loss: 0.0221
Epoch 336/1000
2000/2000 [==============================] - 814s 407ms/step - batch: 999.5000 - size: 192.0000 - loss: 0.0229 - val_loss: 0.0228
Epoch 337/1000
2000/2000 [==============================] - 814s 407ms/step - batch: 999.5000 - size: 192.0000 - loss: 0.0227 - val_loss: 0.0227
Epoch 338/1000
2000/2000 [==============================] - 813s 407ms/step - batch: 999.5000 - size: 192.0000 - loss: 0.0226 - val_loss: 0.0219
Epoch 339/1000
2000/2000 [==============================] - 814s 407ms/step - batch: 999.5000 - size: 192.0000 - loss: 0.0227 - val_loss: 0.0218
Epoch 340/1000
2000/2000 [==============================] - 814s 407ms/step - batch: 999.5000 - size: 192.0000 - loss: 0.0221 - val_loss: 0.0219
Epoch 341/1000
2000/2000 [==============================] - 814s 407ms/step - batch: 999.5000 - size: 192.0000 - loss: 0.0223 - val_loss: 0.0216
Epoch 342/1000
2000/2000 [==============================] - 814s 407ms/step - batch: 999.5000 - size: 192.0000 - loss: 0.0222 - val_loss: 0.0215Looks like it's fitting nicely, testing some of the later epochs look pretty good as well. I'll find somewhere to put the checkpoints and a couple of the preparation scripts.
how are you split train/val/test dataset? I found in code that train/val/test is come from same speaker, have you try to split dataset with difference speaker. And i also curious with your results.
Hey @ntdat017, I haven't modified the training harness at all so the validation split is being calculated how it's written. For test, I've got a holdout set of data from the voxceleb dataset that I'll use to perform the evaluation.
Sorry for the delay, it's been a busy week. The triplet training finally converged after a bit over 600 epochs. I haven't had a chance to fully evaluate the output yet, but I've gone ahead and uploaded the checkpoints and some helper scripts I used in case anyone reading along is interested.
https://drive.google.com/drive/folders/1EExljgrj3kP-ciUzrsdoWYE5OT14_7Aa
sha256 hashes:
b71ca16f8364605a8234c9458f8b2b5ae8c2e0f7ca1d551de4d332acdb40ab90 ResCNN_softmax_checkpoint_38.h5
d86a3ac61a427bbc6f425e3b561dd9ed28f57b789f0eb4bf04d3434113f86dab ResCNN_triplet_checkpoint_613.h5
There are 3 files there, the 2 checkpoints (softmax + triplet) and a tar file with some helper scripts. The helper python scripts probably don't run out of the box, but they're pretty simple and should be easy to fix up.
process_vox.py - this will generate a file that can be split/executed as bash commands that will convert the vox speech files into the correct naming scheme and proper encoding (will require ffmpeg + flac support).
create_noise.py - this will use random noise samples from https://github.com/microsoft/MS-SNSD to generate 'noisy' versions of each input audio clip.
@w1nk that's really awesome!!!! I'm going to have a look this weekend.
Sorry for the delay, it's been a busy week. The triplet training finally converged after a bit over 600 epochs. I haven't had a chance to fully evaluate the output yet, but I've gone ahead and uploaded the checkpoints and some helper scripts I used in case anyone reading along is interested.
https://drive.google.com/drive/folders/1EExljgrj3kP-ciUzrsdoWYE5OT14_7Aa
sha256 hashes:
b71ca16f8364605a8234c9458f8b2b5ae8c2e0f7ca1d551de4d332acdb40ab90 ResCNN_softmax_checkpoint_38.h5
d86a3ac61a427bbc6f425e3b561dd9ed28f57b789f0eb4bf04d3434113f86dab ResCNN_triplet_checkpoint_613.h5There are 3 files there, the 2 checkpoints (softmax + triplet) and a tar file with some helper scripts. The helper python scripts probably don't run out of the box, but they're pretty simple and should be easy to fix up.
process_vox.py - this will generate a file that can be split/executed as bash commands that will convert the vox speech files into the correct naming scheme and proper encoding (will require ffmpeg + flac support).
create_noise.py - this will use random noise samples from https://github.com/microsoft/MS-SNSD to generate 'noisy' versions of each input audio clip.
I got an error when loading this model.
model = keras.models.load_model('ResCNN_triplet_checkpoint_613.h5', compile=False)
File "D:\Python\Python36\lib\site-packages\tensorflow\python\keras\saving\save.py", line 182, in load_model
return hdf5_format.load_model_from_hdf5(filepath, custom_objects, compile)
File "D:\Python\Python36\lib\site-packages\tensorflow\python\keras\saving\hdf5_format.py", line 178, in load_model_from_hdf5
custom_objects=custom_objects)
File "D:\Python\Python36\lib\site-packages\tensorflow\python\keras\saving\model_config.py", line 55, in model_from_config
return deserialize(config, custom_objects=custom_objects)
File "D:\Python\Python36\lib\site-packages\tensorflow\python\keras\layers\serialization.py", line 175, in deserialize
printable_module_name='layer')
File "D:\Python\Python36\lib\site-packages\tensorflow\python\keras\utils\generic_utils.py", line 358, in deserialize_keras_object
list(custom_objects.items())))
File "D:\Python\Python36\lib\site-packages\tensorflow\python\keras\engine\functional.py", line 617, in from_config
config, custom_objects)
File "D:\Python\Python36\lib\site-packages\tensorflow\python\keras\engine\functional.py", line 1204, in reconstruct_from_config
process_layer(layer_data)
File "D:\Python\Python36\lib\site-packages\tensorflow\python\keras\engine\functional.py", line 1186, in process_layer
layer = deserialize_layer(layer_data, custom_objects=custom_objects)
File "D:\Python\Python36\lib\site-packages\tensorflow\python\keras\layers\serialization.py", line 175, in deserialize
printable_module_name='layer')
File "D:\Python\Python36\lib\site-packages\tensorflow\python\keras\utils\generic_utils.py", line 358, in deserialize_keras_object
list(custom_objects.items())))
File "D:\Python\Python36\lib\site-packages\tensorflow\python\keras\layers\core.py", line 1006, in from_config
config, custom_objects, 'function', 'module', 'function_type')
File "D:\Python\Python36\lib\site-packages\tensorflow\python\keras\layers\core.py", line 1058, in _parse_function_from_config
config[func_attr_name], globs=globs)
File "D:\Python\Python36\lib\site-packages\tensorflow\python\keras\utils\generic_utils.py", line 457, in func_load
code = marshal.loads(raw_code)
ValueError: bad marshal data (unknown type code)
What is your version of Keras, Tensorflow, and Python?
@demonstan the ones specified in the requirements.txt of the repo.
@demonstan I've not had a chance to perform the evaluation fully yet. Since I trained on all of librispeech and all the voxceleb2 training data, I need to take the voxceleb2 test data set and convert/rename it to the correct format and evaluate on that. I've not had a chance to do that yet.
As for loading, it should load with TF 2.1/2/3 (I tried all of them) along with 1.15 as well. I was loading the model across those versions trying to get the tflite/coral compilation to work (hint: I didn't yet due to a coral compiler issue).
May I ask why not using EarlyStopping and ReduceLROnPlateau call back here?
Lines 40 to 42 in 7742796
@demonstan they could be used indeed. It's just that I always saw the loss decreasing steadily and I didn't think it was a necessity. Overfitting on this dataset would have been a pretty big challenge. The loss looked like an exponentially decreasing function on both the training and testing sets.
I've been having fun playing with your pre-trained model and implementation!
I've noticed a phenomena that could be a point of improvement. When you record silence or background noise, and extract the features from that, say
silent_features
. It has a strong cosine_similarity to anything. I was wondering whether if you train the model and include various background noises / silence on the train_set and label them allsilent_features
, it would learn to predict varioussilent_features
and distinguish it from voices.
It may also be helful to use SOX to remove silence and background noise. That's what i usually do. Denoise and split by silence and then compute embeddings.
Good point.
Linked to the README for reference.