Use of scaled rotary in GPT-2 model
kaiokendev opened this issue · comments
Actually, our project shares some part of the codebase with another paper from our team "A Length-Extrapolatable Transformer". That paper provides great empirical studies on the properties and advantages of different positional embedding methods, including rotary, alibi, and the proposed XPOS. In our implementation, we just share the LLM class for the gpt2 with the XPOS paper. We did not perform any ablation study on the selection on positional embedding during the pre-training stage. We select Alibi as it will not bring the duplicated positional information between local context and retrieved long past context. And meanwhile Alibi is really easy to implement and very suitable for long-context models.
@Victorwz Thanks for the clarification!