FranxYao / Long-Context-Data-Engineering

Implementation of paper Data Engineering for Scaling Language Models to 128K Context

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Collapsed performance in short length (related to a bug in HF's LlamaDynamicNTKScalingRotaryEmbedding)

gaotianyu1350 opened this issue · comments

Congrats to the great work. I noticed that with the current HF code, the model will collapse with very short input length. For example, I tried a text completion task with top p = 0.95, but the model (yaofu/llama-2-7b-80k) started to repeat:

You >>> We study the continual pretraining recipe for scaling language models' context lengths to 128K, with a focus on data engineering. We hypothesize that long context modeling, in particular \textit{the ability to utilize information at arbitrary input locations}, is a capability that is mostly already acquired through large-scale pretraining, and that this capability can be readily extended to contexts substantially longer than seen during training~(e.g., 4K to 128K) through lightweight continual pretraining on appropriate data mixture. We investigate the \textit{quantity} and \textit{quality} of the data for continual pretraining: (1) for quantity,
Bot >>> and for quality, for qual qual qual qual qual qual qual qual qual qual qual qual qual qual qual qual qual qual qual qual qual qual qual qual qual qual qual qual qual qual qual qual qual qual qual qual qual qual qual qual qual qual qual qual qual qual qual qual qual qual qual qual qual qual qual qual qual qual qual qual qual qual qual qual qual qual qual qual qual qual qual qual qual qual qual qual qual qual qual qual qual qual qual qual qual qual qual qual qual qual qual qual qual qual qual

And it is because in current HF implementation, the base is only changed from 10,000 to a new one if the current input length is longer than 4K. Since this model has been trained on longer length for a long time, it "forgot" the old base and hence collapsed. To fix this, the base should simply set to be the same as the training one. A quick (but hacky) solution is to set seq_len to 80K here, and the generation looks okay now:

You >>> We study the continual pretraining recipe for scaling language models' context lengths to 128K, with a focus on data engineering. We hypothesize that long context modeling, in particular \textit{the ability to utilize information at arbitrary input locations}, is a capability that is mostly already acquired through large-scale pretraining, and that this capability can be readily extended to contexts substantially longer than seen during training~(e.g., 4K to 128K) through lightweight continual pretraining on appropriate data mixture. We investigate the \textit{quantity} and \textit{quality} of the data for continual pretraining: (1) for quantity,
Bot >>> we treat the data mixture itself as continuous data and build a metric to assess its fidelity to the ground-truth data, providing confidence that mixes provide systematic, sample-to-sample information; (2) for quality, we evaluate the efficacy of 128K language model that utilizes the mix, which we call FiLM-128k on a series of downstream tasks; and (3) compare FiLM-128k

Since the DynamicNTKRotaryEmbedding is equivalent to changing the base, the easiest solution without touching the HF code is to set rope_theta (=base in rotary embedding) to the same as the training one, in the config.json file.