About gpt.py line 134-135
hufuzhipeng opened this issue · comments
Acording to the paper of transformer , it seems that we can change
x = x + self.sa(self.ln1(x))
x = x + self.ffwd(self.ln2(x))
to
x = self.ln1(x + self.sa(x))
x = self.ln2(x + self.ffwd(x))
Although the result is similar.
Yes. In his video, he does go over why he's doing this. You can see his explanation here: https://youtu.be/kCc8FmEb1nY?si=VFtUYR-MjtrjR-Lw&t=5722
It's because there has been a "reshuffling" of the structure, as he puts it.