Was the base frequency increased, or do you rely on position interpolation via scaling?
tgunter opened this issue · comments
Hi!
In your paper you mention:
We do not make any significant change to model architecture other than ad- justing the base of RoPE, as in Xiong et al. (2023).
however it appears that the published yaofu/llama-2-7b-80k
model relies on a rope scaling factor (as in the position interpolation paper of Chen et al., 2023), and leaves base_theta unchanged (unlike in Xiong et al. 2023).
Is this correct---does your result in fact rely on the position interpolation from Chen et al. and not the base frequency change advocated by Xiong et al?
Thanks!
I have the same confusion
Same confusion.
Actually, with a context window size of 64K/80K in continual pretraining, there's no need to adjust base
to achieve 100k window extrapolation.
Hi,
I just checked the code. Here is the story
- I modified rope using the following code:
- This code calls the
_set_cos_sin_cache
function in HF transformers==4.35.2 here https://github.com/huggingface/transformers/blob/514de24abfd4416aeba6a6455ad5920f57f3567d/src/transformers/models/llama/modeling_llama.py#L175 - Within this function, the
base
argument is modifed in https://github.com/huggingface/transformers/blob/514de24abfd4416aeba6a6455ad5920f57f3567d/src/transformers/models/llama/modeling_llama.py#L179 - Note that in the recent 4.38.2 version the function _set_cos_sin_cache is removed from HF (I don't know why)
But actually I tend to view it is not necessary to fuss over the base
parameter -- as long as it is larger than 128K, as long as you continue pretrain your model on enough data, the performance, I would say, be similar, no matter what kind of base / interpolation you use -- Linear / NTK / YaRN would be all the same