mosaicml / llm-foundry

LLM training code for Databricks foundation models

Home Page:https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How does packing work for non-MPT models?

lorabit110 opened this issue · comments

I have noticed that llm-foundry has a data collator for packing short examples into one training instance to reduce waste on padding. This relies on 3-d attention mask to make sure it equal to training with separate instances. However, 3d attention mask is not supported by HuggingFace Transformers' model implementations.

I wonder when I use llm-foundry for training llama2, falcon, and mistral, does llm-foundry automatically replace the attention implementation to support 3d attention mask or that's something we need to implement by ourselves.

Also does the data collator work for SFT where we have prompts and responses?

Your understanding is correct. We do not create a 3d attention mask for llama, mistral, etc. For MPT, you can control this behavior with the attn_uses_sequence_id argument. In practice it seems to be ok to train with packed sequences without the 3d mask, although I'd be happy to hear about any experience you have with and without the extra masking!

The data collator does work for SFT with prompt/response, yes.

Thanks for confirming. I guess in order to make it work for llama, we will need to implement something similar to what you have done for MPT to support attn_uses_sequence_id. I will share our findings and code once it's done.

Hi @dakinggg, a relevant question, when I set packing_ratio=2, the data loader fetches 2 x per_device_batch_size examples for each device batch. What happens to those overflow examples / tokens? Do we just drop them?

I have tried training without 3d attention mask on very short examples. Packing ratio is > 100. The train loss didn't converge.

When you set packing ratio, yes it will concatenate samples together. As for convergence, there could be many things that affect this. We haven't seen any issues with sequence packing not converging. One thing is that if you don't change any other hyperparameters, sequence packing means you have (many) more tokens per batch than before, so a different learning rate may be required.

When you set packing ratio, yes it will concatenate samples together. As for convergence, there could be many things that affect this. We haven't seen any issues with sequence packing not converging. One thing is that if you don't change any other hyperparameters, sequence packing means you have (many) more tokens per batch than before, so a different learning rate may be required.

You are right... I needed to train the model for a few more steps with a larger LR. Thanks for the help.

I have implemented the 3d attention mask for Mistral. (basically by adding sequence_id support)
It seems 3d attention mask does help. Please see the below plots.

Screenshot 2024-01-17 at 12 40 19 AM
(The eval set is a subset of train set with packing ratio = 1 for infra validation purposes)

Also, I had to switch to transformers@4.35.2 from 4.36.2 in order to make my patch work. Otherwise, the loss won't converge. I think it's related to a github issue you created in the transformers repo.

I'd be curious to know if that model is actually better. It looks like your dataset was pretty easy and the model got it completely (0 loss). It seems reasonable that the packed, non masked version would converge to not quite 0, since it can't predict the first token after an eos. But the models may be pretty much identical otherwise.

Screenshot 2024-01-17 at 11 40 28 AM
In this new experiment, the train eval set doesn't overlap the train set. In this way, we test the model's generalization capability w/ or w/o 3d attention mask. Indeed, the model is not actually better with attention mask. But it does seem to allow the model to train faster (get to the eval/loss point with fewer steps).

Closing as I think the question has been answered. Fell free to open a new issue if you have more questions!