FranxYao / Long-Context-Data-Engineering

Implementation of paper Data Engineering for Scaling Language Models to 128K Context

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Was the base frequency increased, or do you rely on position interpolation via scaling?

tgunter opened this issue · comments

Hi!

In your paper you mention:

We do not make any significant change to model architecture other than ad- justing the base of RoPE, as in Xiong et al. (2023).

however it appears that the published yaofu/llama-2-7b-80k model relies on a rope scaling factor (as in the position interpolation paper of Chen et al., 2023), and leaves base_theta unchanged (unlike in Xiong et al. 2023).

Is this correct---does your result in fact rely on the position interpolation from Chen et al. and not the base frequency change advocated by Xiong et al?

Thanks!

I have the same confusion

Same confusion.
Actually, with a context window size of 64K/80K in continual pretraining, there's no need to adjust base to achieve 100k window extrapolation.

Hi,

I just checked the code. Here is the story

But actually I tend to view it is not necessary to fuss over the base parameter -- as long as it is larger than 128K, as long as you continue pretrain your model on enough data, the performance, I would say, be similar, no matter what kind of base / interpolation you use -- Linear / NTK / YaRN would be all the same