lucidrains / RETRO-pytorch

Implementation of RETRO, Deepmind's Retrieval based Attention net, in Pytorch

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Autoregressivity

sdpmas opened this issue · comments

I had a question about Figure 2 and equation 3 from the paper. How does the last token of each chunk C_u being able to attend to the retrieved content E_u not break autoregressivity?

so basically you have to make sure past tokens never see a future token. the last token is the most far future token, it can safely attend to all of E_u without violating that rule

@sdpmas the same trick was actually used here https://arxiv.org/abs/2110.13711 (i think deepmind probably read this paper and got some inspiration tbh)

I see, thanks a lot for the explanation!