question about the rnnt loss arguments

Question

question about the rnnt loss arguments

songtaoshi opened this issue 3 years ago · comments

        log_probs (torch.FloatTensor): Input tensor with shape (N, T, U, V)
            where N is the minibatch size, T is the maximum number of
            input frames, U is the maximum number of output labels and V is
            the vocabulary of labels (including the blank).
        labels (torch.IntTensor): Tensor with shape (N, U-1) representing the
            reference labels for all samples in the minibatch.

Hi, I am confused about the labels, why the shape should be U-1,
<eos> should not be included in the labels ?
@1ytic

songtaoshi · Answer 1 · Thu May 13 2021 21:59:29 GMT+0800 (China Standard Time)

and also I see the training code, the LM input ys is the same as the target ys.
This should not be +text as input; text+ as output?

Ivan Sorokin · Answer 2 · Mon May 17 2021 16:03:17 GMT+0800 (China Standard Time)

If I remember correctly, U includes "empty" output, very similar to the first element in the scoring matrix when you align two sequences, for example like this https://en.wikipedia.org/wiki/Smith–Waterman_algorithm

zhaoyang9425 · Answer 3 · Wed Jun 16 2021 02:50:13 GMT+0800 (China Standard Time)

        log_probs (torch.FloatTensor): Input tensor with shape (N, T, U, V)
            where N is the minibatch size, T is the maximum number of
            input frames, U is the maximum number of output labels and V is
            the vocabulary of labels (including the blank).
        labels (torch.IntTensor): Tensor with shape (N, U-1) representing the
            reference labels for all samples in the minibatch.

Hi, I am confused about the labels, why the shape should be U-1,
<eos> should not be included in the labels ?
@1ytic

I have the same doubt, do you understand it? Why the shape of labels be U-1?

nihaoUCAS · Answer 4 · Fri Mar 18 2022 17:18:50 GMT+0800 (China Standard Time)

I guess, U = len() + len(labels), len() = 1. shouldn't in the labels, but in the encoder logits