philipperemy / deep-speaker

I've been having fun playing with your pre-trained model and implementation!

I've noticed a phenomena that could be a point of improvement. When you record silence or background noise, and extract the features from that, say silent_features. It has a strong cosine_similarity to anything. I was wondering whether if you train the model and include various background noises / silence on the train_set and label them all silent_features, it would learn to predict various silent_features and distinguish it from voices.

Happy to hear that!

So from what I can say, the model was trained on clean speech without silence nor background noise. So technically, the model has only heard clear voices so far. If I can draw a parallel with a simple cat/dog classifier, it would be like showing a car to the model. It would either predict a cat or a dog.

if you train the model and include various background noises / silence on the train_set and label them all silent_features, it would learn to predict various silent_features and distinguish it from voices.

Yes it's true. I'm sure the model can be smart enough to learn this too.

Hello!

I've taken the repo/dataset and combined it with the Voxceleb2 dataset (6112 speakers). I also added a 'speaker' that was composed of a bunch of noise/silence samples. After I processed the voxceleb data into the same format (flac, 16khz, 24bit samples) as the librispeech data, I made another pass over both datasets, and for every utterance, I created 2 new training examples that were combined with random noise selected from https://github.com/microsoft/MS-SNSD . That resulted in around 730gb of training data. I've added 1k speakers to the initial classifier/softmax training and am currently running that training. Once it's complete I'll complete the triplet loss training and share the code/weights. I'm running it on a 2080ti, with 64gb of RAM, and I needed a bit over 200gb of swap space to keep the OOM killer at bay. An epoch is currently taking slightly over 1 hour.

Talk to you in a week or two :)

@w1nk AWESOME! Please let us know how it goes :)

Just an update:

I ended up needing to switch versions of tensorflow (switched to 2.3), 2.2 has a nasty memory leak that was getting triggered. Once I got things running stably, the softmax network converged / I early stopped it at epoch 38 and started training the triplet loss. That network is currently still training, but is getting close:

2000/2000 [==============================] - 815s 407ms/step - batch: 999.5000 - size: 192.0000 - loss: 0.0230 - val_loss: 0.0221
Epoch 336/1000
2000/2000 [==============================] - 814s 407ms/step - batch: 999.5000 - size: 192.0000 - loss: 0.0229 - val_loss: 0.0228
Epoch 337/1000
2000/2000 [==============================] - 814s 407ms/step - batch: 999.5000 - size: 192.0000 - loss: 0.0227 - val_loss: 0.0227
Epoch 338/1000
2000/2000 [==============================] - 813s 407ms/step - batch: 999.5000 - size: 192.0000 - loss: 0.0226 - val_loss: 0.0219
Epoch 339/1000
2000/2000 [==============================] - 814s 407ms/step - batch: 999.5000 - size: 192.0000 - loss: 0.0227 - val_loss: 0.0218
Epoch 340/1000
2000/2000 [==============================] - 814s 407ms/step - batch: 999.5000 - size: 192.0000 - loss: 0.0221 - val_loss: 0.0219
Epoch 341/1000
2000/2000 [==============================] - 814s 407ms/step - batch: 999.5000 - size: 192.0000 - loss: 0.0223 - val_loss: 0.0216
Epoch 342/1000
2000/2000 [==============================] - 814s 407ms/step - batch: 999.5000 - size: 192.0000 - loss: 0.0222 - val_loss: 0.0215

Looks like it's fitting nicely, testing some of the later epochs look pretty good as well. I'll find somewhere to put the checkpoints and a couple of the preparation scripts.

@w1nk very cool!

Just an update:

I ended up needing to switch versions of tensorflow (switched to 2.3), 2.2 has a nasty memory leak that was getting triggered. Once I got things running stably, the softmax network converged / I early stopped it at epoch 38 and started training the triplet loss. That network is currently still training, but is getting close:

2000/2000 [==============================] - 815s 407ms/step - batch: 999.5000 - size: 192.0000 - loss: 0.0230 - val_loss: 0.0221
Epoch 336/1000
2000/2000 [==============================] - 814s 407ms/step - batch: 999.5000 - size: 192.0000 - loss: 0.0229 - val_loss: 0.0228
Epoch 337/1000
2000/2000 [==============================] - 814s 407ms/step - batch: 999.5000 - size: 192.0000 - loss: 0.0227 - val_loss: 0.0227
Epoch 338/1000
2000/2000 [==============================] - 813s 407ms/step - batch: 999.5000 - size: 192.0000 - loss: 0.0226 - val_loss: 0.0219
Epoch 339/1000
2000/2000 [==============================] - 814s 407ms/step - batch: 999.5000 - size: 192.0000 - loss: 0.0227 - val_loss: 0.0218
Epoch 340/1000
2000/2000 [==============================] - 814s 407ms/step - batch: 999.5000 - size: 192.0000 - loss: 0.0221 - val_loss: 0.0219
Epoch 341/1000
2000/2000 [==============================] - 814s 407ms/step - batch: 999.5000 - size: 192.0000 - loss: 0.0223 - val_loss: 0.0216
Epoch 342/1000
2000/2000 [==============================] - 814s 407ms/step - batch: 999.5000 - size: 192.0000 - loss: 0.0222 - val_loss: 0.0215

Looks like it's fitting nicely, testing some of the later epochs look pretty good as well. I'll find somewhere to put the checkpoints and a couple of the preparation scripts.

how are you split train/val/test dataset? I found in code that train/val/test is come from same speaker, have you try to split dataset with difference speaker. And i also curious with your results.

Hey @ntdat017, I haven't modified the training harness at all so the validation split is being calculated how it's written. For test, I've got a holdout set of data from the voxceleb dataset that I'll use to perform the evaluation.

Sorry for the delay, it's been a busy week. The triplet training finally converged after a bit over 600 epochs. I haven't had a chance to fully evaluate the output yet, but I've gone ahead and uploaded the checkpoints and some helper scripts I used in case anyone reading along is interested.

https://drive.google.com/drive/folders/1EExljgrj3kP-ciUzrsdoWYE5OT14_7Aa

sha256 hashes:
b71ca16f8364605a8234c9458f8b2b5ae8c2e0f7ca1d551de4d332acdb40ab90 ResCNN_softmax_checkpoint_38.h5
d86a3ac61a427bbc6f425e3b561dd9ed28f57b789f0eb4bf04d3434113f86dab ResCNN_triplet_checkpoint_613.h5

There are 3 files there, the 2 checkpoints (softmax + triplet) and a tar file with some helper scripts. The helper python scripts probably don't run out of the box, but they're pretty simple and should be easy to fix up.

process_vox.py - this will generate a file that can be split/executed as bash commands that will convert the vox speech files into the correct naming scheme and proper encoding (will require ffmpeg + flac support).

create_noise.py - this will use random noise samples from https://github.com/microsoft/MS-SNSD to generate 'noisy' versions of each input audio clip.

@w1nk that's really awesome!!!! I'm going to have a look this weekend.

Sorry for the delay, it's been a busy week. The triplet training finally converged after a bit over 600 epochs. I haven't had a chance to fully evaluate the output yet, but I've gone ahead and uploaded the checkpoints and some helper scripts I used in case anyone reading along is interested.

https://drive.google.com/drive/folders/1EExljgrj3kP-ciUzrsdoWYE5OT14_7Aa

sha256 hashes:
b71ca16f8364605a8234c9458f8b2b5ae8c2e0f7ca1d551de4d332acdb40ab90 ResCNN_softmax_checkpoint_38.h5
d86a3ac61a427bbc6f425e3b561dd9ed28f57b789f0eb4bf04d3434113f86dab ResCNN_triplet_checkpoint_613.h5

There are 3 files there, the 2 checkpoints (softmax + triplet) and a tar file with some helper scripts. The helper python scripts probably don't run out of the box, but they're pretty simple and should be easy to fix up.

process_vox.py - this will generate a file that can be split/executed as bash commands that will convert the vox speech files into the correct naming scheme and proper encoding (will require ffmpeg + flac support).

create_noise.py - this will use random noise samples from https://github.com/microsoft/MS-SNSD to generate 'noisy' versions of each input audio clip.

I got an error when loading this model.

model = keras.models.load_model('ResCNN_triplet_checkpoint_613.h5', compile=False)

  File "D:\Python\Python36\lib\site-packages\tensorflow\python\keras\saving\save.py", line 182, in load_model
    return hdf5_format.load_model_from_hdf5(filepath, custom_objects, compile)
  File "D:\Python\Python36\lib\site-packages\tensorflow\python\keras\saving\hdf5_format.py", line 178, in load_model_from_hdf5
    custom_objects=custom_objects)
  File "D:\Python\Python36\lib\site-packages\tensorflow\python\keras\saving\model_config.py", line 55, in model_from_config
    return deserialize(config, custom_objects=custom_objects)
  File "D:\Python\Python36\lib\site-packages\tensorflow\python\keras\layers\serialization.py", line 175, in deserialize
    printable_module_name='layer')
  File "D:\Python\Python36\lib\site-packages\tensorflow\python\keras\utils\generic_utils.py", line 358, in deserialize_keras_object
    list(custom_objects.items())))
  File "D:\Python\Python36\lib\site-packages\tensorflow\python\keras\engine\functional.py", line 617, in from_config
    config, custom_objects)
  File "D:\Python\Python36\lib\site-packages\tensorflow\python\keras\engine\functional.py", line 1204, in reconstruct_from_config
    process_layer(layer_data)
  File "D:\Python\Python36\lib\site-packages\tensorflow\python\keras\engine\functional.py", line 1186, in process_layer
    layer = deserialize_layer(layer_data, custom_objects=custom_objects)
  File "D:\Python\Python36\lib\site-packages\tensorflow\python\keras\layers\serialization.py", line 175, in deserialize
    printable_module_name='layer')
  File "D:\Python\Python36\lib\site-packages\tensorflow\python\keras\utils\generic_utils.py", line 358, in deserialize_keras_object
    list(custom_objects.items())))
  File "D:\Python\Python36\lib\site-packages\tensorflow\python\keras\layers\core.py", line 1006, in from_config
    config, custom_objects, 'function', 'module', 'function_type')
  File "D:\Python\Python36\lib\site-packages\tensorflow\python\keras\layers\core.py", line 1058, in _parse_function_from_config
    config[func_attr_name], globs=globs)
  File "D:\Python\Python36\lib\site-packages\tensorflow\python\keras\utils\generic_utils.py", line 457, in func_load
    code = marshal.loads(raw_code)
ValueError: bad marshal data (unknown type code)

What is your version of Keras, Tensorflow, and Python?

@demonstan the ones specified in the requirements.txt of the repo.

@w1nk Did you perform evaluation on any dataset?

@demonstan I've not had a chance to perform the evaluation fully yet. Since I trained on all of librispeech and all the voxceleb2 training data, I need to take the voxceleb2 test data set and convert/rename it to the correct format and evaluate on that. I've not had a chance to do that yet.

As for loading, it should load with TF 2.1/2/3 (I tried all of them) along with 1.15 as well. I was loading the model across those versions trying to get the tflite/coral compilation to work (hint: I didn't yet due to a coral compiler issue).

May I ask why not using EarlyStopping and ReduceLROnPlateau call back here?

deep-speaker/train.py

Lines 40 to 42 in 7742796

    
           dsm.m.fit(x=train_generator(), y=None, steps_per_epoch=2000, shuffle=False, 
        
                     epochs=1000, validation_data=test_generator(), validation_steps=len(test_batches), 
        
                     callbacks=[checkpoint])

@demonstan they could be used indeed. It's just that I always saw the loss decreasing steadily and I didn't think it was a necessity. Overfitting on this dataset would have been a pretty big challenge. The loss looked like an exponentially decreasing function on both the training and testing sets.

I've been having fun playing with your pre-trained model and implementation!

I've noticed a phenomena that could be a point of improvement. When you record silence or background noise, and extract the features from that, say silent_features. It has a strong cosine_similarity to anything. I was wondering whether if you train the model and include various background noises / silence on the train_set and label them all silent_features, it would learn to predict various silent_features and distinguish it from voices.

It may also be helful to use SOX to remove silence and background noise. That's what i usually do. Denoise and split by silence and then compute embeddings.

Good point.

Linked to the README for reference.

	dsm.m.fit(x=train_generator(), y=None, steps_per_epoch=2000, shuffle=False,
	epochs=1000, validation_data=test_generator(), validation_steps=len(test_batches),
	callbacks=[checkpoint])

Silence / Background Noise similarity