Use of scaled rotary in GPT-2 model

Question

Use of scaled rotary in GPT-2 model

kaiokendev opened this issue a year ago · comments

Hello, thank you for the work on the paper.

I notice in the revised GPT-2 model there is the use of scaled rotary (XPos), here and here, but the paper makes no mention of scaled rotary (or rotary) being applied or tested. I am wondering if an ablation was performed but not included in the paper?

Weizhi Wang · Answer 1 · Thu Jun 15 2023 16:22:02 GMT+0800 (China Standard Time)

Actually, our project shares some part of the codebase with another paper from our team "A Length-Extrapolatable Transformer". That paper provides great empirical studies on the properties and advantages of different positional embedding methods, including rotary, alibi, and the proposed XPOS. In our implementation, we just share the LLM class for the gpt2 with the XPOS paper. We did not perform any ablation study on the selection on positional embedding during the pre-training stage. We select Alibi as it will not bring the duplicated positional information between local context and retrieved long past context. And meanwhile Alibi is really easy to implement and very suitable for long-context models.

kaiokendev · Answer 2 · Thu Jun 15 2023 17:03:05 GMT+0800 (China Standard Time)

@Victorwz Thanks for the clarification!