About gpt.py line 134-135

Question

About gpt.py line 134-135

hufuzhipeng opened this issue a year ago · comments

Acording to the paper of transformer , it seems that we can change
x = x + self.sa(self.ln1(x))
x = x + self.ffwd(self.ln2(x))
to
x = self.ln1(x + self.sa(x))
x = self.ln2(x + self.ffwd(x))
Although the result is similar.

Shafiq Jetha · Answer 1 · Mon Feb 05 2024 05:25:37 GMT+0800 (China Standard Time)

Yes. In his video, he does go over why he's doing this. You can see his explanation here: https://youtu.be/kCc8FmEb1nY?si=VFtUYR-MjtrjR-Lw&t=5722
It's because there has been a "reshuffling" of the structure, as he puts it.