baidu-research / ba-dls-deepspeech

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Suggestions for improving dev-set performance.

Feynman27 opened this issue · comments

(I apologize if this question is better suited for StackOverflow, but I figure posting it here will reach the right audience in a shorter amount of time.)

I'm training this CTC-cost model on the Librispeech "train-other-500" dataset, which contains 500 hours of speech audio+transcripts. I'm using the "dev-other" data set for development, which is apparently a more challenging audio set to model.

I trained the model over 20 epochs and have provided the distribution of the costs below.
image

The weights are updated according to Nesterov momentum.

Since the validation performance plateaus at around iter=25000, I decided to checkpoint the model here and continue running the model using an exponential learning-rate decay schedule. The learning rate is decreased after each epoch (starting from iter=25000). The CTC costs using this learning-rate decay schedule are shown below after a few epochs:

image

Unfortunately, this strategy doesn't appear to improve the model performance. Does anyone have any suggestions on how to improve the model other than what I've described above?

From the looks of it, your model seems to have high variance. You should try reducing the initial learning rate, add in regularization (dropout/augment with noise) or play with the model architecture if these ideas don't work.

You can also try increasing the data you're training on. By default the max wav length is set to 10 seconds (https://github.com/baidu-research/ba-dls-deepspeech/blob/master/data_generator.py#L53-L54) which excludes a good portion of the data in the LibriSpeech corpora. Longer utterances most likely will require more memory usage though.