Why No Softmax?

Question

Why No Softmax?

kingnobro opened this issue 2 years ago · comments

In class DALLE(nn.Module), there is a member called to_logits,

self.to_logits = nn.Sequential(
    nn.LayerNorm(dim),
    nn.Linear(dim, self.total_tokens),
)

Why there is no Softmax after nn.Linear? I read the paper Attention Is All You Need, and there is a softmax function after linear layer.
If there is no softmax, the value in logits might be very big. So in function generate_images, when passing the logits containing a very big number to function gumbel_sample, the uniform noise cannot influence the sample result.