Victorwz / LongMem

Official implementation of our NeurIPS 2023 paper "Augmenting Language Models with Long-Term Memory".

Home Page:https://arxiv.org/abs/2306.07174

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Use of scaled rotary in GPT-2 model

kaiokendev opened this issue · comments

Hello, thank you for the work on the paper.

I notice in the revised GPT-2 model there is the use of scaled rotary (XPos), here and here, but the paper makes no mention of scaled rotary (or rotary) being applied or tested. I am wondering if an ablation was performed but not included in the paper?

Actually, our project shares some part of the codebase with another paper from our team "A Length-Extrapolatable Transformer". That paper provides great empirical studies on the properties and advantages of different positional embedding methods, including rotary, alibi, and the proposed XPOS. In our implementation, we just share the LLM class for the gpt2 with the XPOS paper. We did not perform any ablation study on the selection on positional embedding during the pre-training stage. We select Alibi as it will not bring the duplicated positional information between local context and retrieved long past context. And meanwhile Alibi is really easy to implement and very suitable for long-context models.

@Victorwz Thanks for the clarification!