Autoregressivity
sdpmas opened this issue · comments
Samip Dahal commented
I had a question about Figure 2 and equation 3 from the paper. How does the last token of each chunk C_u being able to attend to the retrieved content E_u not break autoregressivity?
Phil Wang commented
so basically you have to make sure past tokens never see a future token. the last token is the most far future token, it can safely attend to all of E_u
without violating that rule
Phil Wang commented
@sdpmas the same trick was actually used here https://arxiv.org/abs/2110.13711 (i think deepmind probably read this paper and got some inspiration tbh)
Samip Dahal commented
I see, thanks a lot for the explanation!