FranxYao / Long-Context-Data-Engineering

Hi!

In your paper you mention:

We do not make any significant change to model architecture other than ad- justing the base of RoPE, as in Xiong et al. (2023).

however it appears that the published yaofu/llama-2-7b-80k model relies on a rope scaling factor (as in the position interpolation paper of Chen et al., 2023), and leaves base_theta unchanged (unlike in Xiong et al. 2023).

Is this correct---does your result in fact rely on the position interpolation from Chen et al. and not the base frequency change advocated by Xiong et al?

Thanks!

I have the same confusion

Same confusion.
Actually, with a context window size of 64K/80K in continual pretraining, there's no need to adjust base to achieve 100k window extrapolation.

Hi,

I just checked the code. Here is the story

I modified rope using the following code:

Long-Context-Data-Engineering/eval/needle/needle_in_haystack.py

Line 60 in b1f3853

def reset_rope(model, model_max_train_len, scaling_factor):
This code calls the _set_cos_sin_cache function in HF transformers==4.35.2 here https://github.com/huggingface/transformers/blob/514de24abfd4416aeba6a6455ad5920f57f3567d/src/transformers/models/llama/modeling_llama.py#L175
Within this function, the base argument is modifed in https://github.com/huggingface/transformers/blob/514de24abfd4416aeba6a6455ad5920f57f3567d/src/transformers/models/llama/modeling_llama.py#L179
Note that in the recent 4.38.2 version the function _set_cos_sin_cache is removed from HF (I don't know why)

But actually I tend to view it is not necessary to fuss over the base parameter -- as long as it is larger than 128K, as long as you continue pretrain your model on enough data, the performance, I would say, be similar, no matter what kind of base / interpolation you use -- Linear / NTK / YaRN would be all the same

Was the base frequency increased, or do you rely on position interpolation via scaling?