karpathy / ng-video-lecture

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

About gpt.py line 134-135

hufuzhipeng opened this issue · comments

Acording to the paper of transformer , it seems that we can change
x = x + self.sa(self.ln1(x))
x = x + self.ffwd(self.ln2(x))
to
x = self.ln1(x + self.sa(x))
x = self.ln2(x + self.ffwd(x))
Although the result is similar.

Yes. In his video, he does go over why he's doing this. You can see his explanation here: https://youtu.be/kCc8FmEb1nY?si=VFtUYR-MjtrjR-Lw&t=5722
It's because there has been a "reshuffling" of the structure, as he puts it.