Why No Softmax?
kingnobro opened this issue · comments
Ronghuan Wu commented
In class DALLE(nn.Module)
, there is a member called to_logits
,
self.to_logits = nn.Sequential(
nn.LayerNorm(dim),
nn.Linear(dim, self.total_tokens),
)
Why there is no Softmax
after nn.Linear
? I read the paper Attention Is All You Need, and there is a softmax function after linear layer.
If there is no softmax, the value in logits might be very big. So in function generate_images
, when passing the logits containing a very big number to function gumbel_sample
, the uniform noise cannot influence the sample result.