Attention Is All You Need

Question

subinium opened this issue 3 years ago · comments

Concept

Subin An · Answer 1 · Tue Jan 26 2021 15:17:30 GMT+0800 (China Standard Time)

Subin An · Answer 2 · Tue Jan 26 2021 15:41:53 GMT+0800 (China Standard Time)

논문에서는 Convolutional Sequence to Sequence Learning을 참고논문으로 사용하며 positional encoding을 언급
- 이 논문에서 직접적인 언급은 없지만 일부 stackoverflow를 보면 one-hot vector를 사용한 것으로 보인다.
그럼 왜 Transformer에서는 sin, cos을 사용하여 positional encoding을 했을까? (sinusoidal singal)
- 논문 3.5에서는 다음과 같이 언급이 되어 있다.
- We chose this function because we hypothesized it would allow the model to easily learn to attend by relative positions, since for any fixed offset $k, PE_{pos+k}$ can be represented as a linear function of $PE_pos.$
- 즉 위치 A와 위치 A+k는 상대적인 위치 k에만 영향(선형변환)을 받고 싶다는 뜻이다.
- 이에 대한 증명 및 답은 이 링크가 잘 설명해준다. https://timodenk.com/blog/linear-relationships-in-the-transformers-positional-encoding/
이후에 나온 논문에서는 absolute position이 아닌 다음과 같이 pairwise relative position을 추가해주는 게 성능이 더 좋다고 한다.
- Self-Attention with Relative Position Representations : 그리고 이것도 구글이다.

Subin An · Answer 3 · Tue Jan 26 2021 15:44:42 GMT+0800 (China Standard Time)

Subin An · Answer 4 · Tue Jan 26 2021 16:03:39 GMT+0800 (China Standard Time)

여러 번 어텐션을 봐서 좀 더 성능을 높이는 느낌으로 Multi-head Attention을 사용하는 걸로 이해 (앙상블)
이와 관련한 NIPS2019 논문
- Are Sixteen Heads Really Better than One?
- TL;DR 도움은 되지만 실제로 많이 중복되고 이를 잘 pruning하는 것이 과제라함

관련하여 제프리 힌튼 교수 연구팀이 발표한 NIPS2019 When Does Label Smoothing Help? 논문
- 잘 정리해준 ratsgo님의 블로그 Label Smoothing 이해하기
- Knowledge Distillation보다는 안된다고 한다. (상호정보량의 감소)
- 뭔가 관련하여 연구하면 재밌을 것 같은데 아이디어가 번뜩이지는 않는다ㅠ