Non-causal implementation of language model for synthetic datasets

Question

Non-causal implementation of language model for synthetic datasets

Karami-m opened this issue 10 months ago · comments

Regarding synthetic datasets: from the implementation and as it explained in the issue, train loss is evaluated on all tokens and test is only evaluated on the last token . Then my question is what is the advantage of such autoregressive training strategy, which require the model to be causal, rather than simply modelling the training as a classification problem, i.e. loss and accuracy of training is evaluated only on the last token as such that
$p({y}[..., -1]) \simeq Hyena(x) [..., -1]$
If we follow this training approach then the target is estimated based on all the token in the sentence and, it seems that, it is not required for the model to be causal for datasets: Associative Recall and induction head, is it trues?

Dan Fu · Answer 1 · Thu Nov 30 2023 12:04:02 GMT+0800 (China Standard Time)

This synthetic was designed to figure out the gap on causal language modeling and was originally used for a state space model (H3) which is naturally causal by design. We wanted the synthetic to match the actual language task as closely as possible. We used a similar synthetic when designing Monarch Mixer, but non causal as that’s a BERT model.

…

On Wed, Nov 29, 2023 at 7:51 PM Mahdi Karami ***@***.***> wrote: *Regarding synthetic datasets:* from the implementation and as it explained in the issue <#35>, *train loss is evaluated on all tokens and test is only evaluated on the last token* . Then my question is what is the advantage of such autoregressive training strategy, which require the model to be causal, rather than simply modelling the training as a classification problem, i.e. loss and accuracy of training is evaluated only on the last token as such that $p({y}[..., -1]) \simeq Hyena(x) [..., -1]$ If we follow this training approach then the target is estimated based on all the token in the sentence and, it seems that, it is not required for the model to be causal for datasets: *Associative Recall* and *induction head*, is it trues? — Reply to this email directly, view it on GitHub <#42>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABDDIIVD25NDWSVDH66FL3TYG77EHAVCNFSM6AAAAABAAQQFLGVHI2DSMVQWIX3LMV43ASLTON2WKOZSGAYTOOBSGIZDONI> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

Mahdi Karami · Answer 2 · Fri Dec 01 2023 06:07:50 GMT+0800 (China Standard Time)

Thanks for your quick response. Regarding this dataset, if the goal is to optimize the test accuracy, reported in the paper which is evaluated on the last token, it seems that we don't need to model the training as a causal language model and we can simply design the training loss to match the test loss by evaluation only the last token. I understand that the goal was to design a benchmark to evaluate the autoregressive (causal) behaviour of the LM but it seems that in this dataset, this different training procedure is indeed working as a regularizer by forcing the model to predict next token while the final goal is the accuracy of the last token only.

Dan Fu · Answer 3 · Fri Dec 01 2023 06:33:02 GMT+0800 (China Standard Time)

My intuition is that the next-token prediction actually makes it easier for the model to learn the circuit it needs to do associative recall, since there's more tokens it needs to predict correctly during training (roughly the second half of the tokens, once it's seen each example once). You can see this when the training accuracy jumps around the same time that the test accuracy jumps - it's learned a generalizable behavior, not just a shortcut.

In any case - the synthetic is mostly useful insofar as it predicts downstream performance on language modeling. There's a lot of techniques that can solve associative recall (e.g., Python function) - the interesting bit is when we can use the synthetic to guess how well a layer will perform downstream.

Mahdi Karami · Answer 4 · Fri Dec 01 2023 11:25:43 GMT+0800 (China Standard Time)

Thanks for your clarification. For Monarch Mixer that you mentioned, did you use the causal form of M2 for associative recall dataset?

I also have another question regarding the induction head dataset: does the same analogy (causality and next token prediction) apply to it as well? I am asking this since, in contrast to associative recall, in this dataset it only need to recall content after a special token that has happened at the middle of the sequence not all the tokens.

Dan Fu · Answer 5 · Sun Dec 03 2023 06:23:38 GMT+0800 (China Standard Time)

For the M2-BERT synthetics, we ran a non-causal form of associative recall to fine-tune the architecture.

For induction head - the same analogy applies for what we implemented here, but you're right that the setup is a bit different.