githubharald / CTCWordBeamSearch

Connectionist Temporal Classification (CTC) decoder with dictionary and language model.

Home Page:https://towardsdatascience.com/b051d28f3d2e

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Question about decoder output.

biscayan opened this issue · comments

If you create a new issue, please provide the following information:

  1. Which program causes the problem
  • NumPy operation (Python package)
  1. Versions
  • Pytorch version 1.6.0
  • Python version 3.7.7
  • Operating system ubuntu 18.04
  1. Issue
    Hi, I'm trying to adapt your decoder to speech recognition project.
    After doing some experiments, I have a question about the decoder output.

My input data (speech -> spectrogram) is fed into the model (CNN+RNN), and the model makes the output which has a shape of [sequence length (T) x batch size (B) x number of characters (C)]. e.g. (371, 32, 29)
Then it is fed into the function "def testPyBind(feedMat, corpus, chars, wordChars)"
My corpus.txt is composed of transcription of validation and test set. It has 146,709 sentences.
My chars.txt has 28 characters ' ABCDEFGHIJKLMNOPQRSTUVWXYZ
My wordchars.txt has 27 characters 'ABCDEFGHIJKLMNOPQRSTUVWXYZ (space is removed from chars.txt)
And then, decoder makes a list of sentences. The list has a batch length. e.g. 32 sentences

The first question is that output sentence from the decoder is 100% same with the corpus sentence.
I supposed that there is some character errors in the output sentence because it is the predicted sentence which is made from the model and decoder.
I don't understand why the decoder makes the sentence which is only in the corpus.

Second question is how can I parallel predicted sentence (output) and target sentence to calculate WER and CER.
I calculated CER and WER, but these sentences totally mismatch.
It is my result.

Epoch: 0/100 | Train loss: 6.999162
Epoch: 0/100 | Val loss: 2.889192
Average CER: 767.66% | Average WER: 1543.43%

Epoch: 5/100 | Train loss: 4.038735
Epoch: 5/100 | Val loss: 1.960615
Average CER: 734.83% | Average WER: 1154.09%

Epoch: 10/100 | Train loss: 3.314095
Epoch: 10/100 | Val loss: 1.570683
Average CER: 747.28% | Average WER: 1155.41%
....

When I use greedy search algorithm, I have a result like this.

Epoch: 0/100 | Train loss: 2.138510
Epoch: 0/100 | Val loss: 2.078584
Average CER: 65.45% | Average WER: 99.07%

Epoch: 5/100 | Train loss: 1.150841
Epoch: 5/100 | Val loss: 1.037751
Average CER: 32.14% | Average WER: 79.66%

Epoch: 10/100 | Train loss: 0.916023
Epoch: 10/100 | Val loss: 0.855249
Average CER: 26.44% | Average WER: 71.63%
...

Please reply my questions.
Thank you for your consideration.

Please narrow down the problem.
I can only have a look at one or two samples for which you think the output is wrong.
Look at the questions asked and the data provided by the user in #49 to see how this could be done effectively.

Provide all data that I need to produce the same results (at least these files: chars.txt, wordChars.txt, corpus.txt, gt_X.txt, mat_X.csv or also a saved NumPy array is fine). Best would be to select one sample where I can easily see what's wrong.
Further, provide exactly the parameters with which you call the method. If you changed any code share this file.

I changed the rnn output a little bit, but the result is more weird

My parameters -> batch size(feedMat.shape[2]) = 32, beamwidth = 25, lmtype = 'Words', lmsmoothing = 0.0
chars.txt chars.txt
wordChars.txt wordChars.txt
corpus.txt corpus.txt
rnn output(npy format) RNN_output.zip
If you need more file or information, please let me know
Thank you.

Had a quick look into the RNN output for batch element 0: only the "CTC blank" character occurs as a strong signal, all other characters have at most a probability of 0.3.
Here a plot for batch element 0 after applying softmax. x-axis=chars, y-axis=time. As you can see, only the blank has high probability predictions.
grafik

So I suggest getting the model accuracy to an acceptable level (only using best path decoding), and only then switch to word beam search to integrate information about language. The language model only makes sense when the prediction from the neural network are already quite good.

I think it is because I gave you rnn output which is trained in early step. I got an acceptable level of accuracy when I did an experiment with greedy search with 100 training epochs. CER is 10% and WER is 30%. However, word beam search can't predict the sentence at all.

You would have to provide an example where best path decoding works better than word beam search, otherwise I can't check what's going on.

Please provide:

  • RNN output (you can again just dump the NumPy array)
  • Ground truth (what is the correct decoding)
  • Your output for greedy decoding and word beam search
  • I suppose the other files (chars.txt, corpus.txt, ...) are still the same as the ones you already uploaded?

closing because of inactivity.