How does packing work for non-MPT models?

Question

How does packing work for non-MPT models?

lorabit110 opened this issue 5 months ago · comments

I have noticed that llm-foundry has a data collator for packing short examples into one training instance to reduce waste on padding. This relies on 3-d attention mask to make sure it equal to training with separate instances. However, 3d attention mask is not supported by HuggingFace Transformers' model implementations.

I wonder when I use llm-foundry for training llama2, falcon, and mistral, does llm-foundry automatically replace the attention implementation to support 3d attention mask or that's something we need to implement by ourselves.

Also does the data collator work for SFT where we have prompts and responses?

Daniel King · Answer 1 · Mon Jan 08 2024 09:05:10 GMT+0800 (China Standard Time)

Your understanding is correct. We do not create a 3d attention mask for llama, mistral, etc. For MPT, you can control this behavior with the attn_uses_sequence_id argument. In practice it seems to be ok to train with packed sequences without the 3d mask, although I'd be happy to hear about any experience you have with and without the extra masking!

The data collator does work for SFT with prompt/response, yes.

Yanan Xie · Answer 2 · Tue Jan 09 2024 04:44:49 GMT+0800 (China Standard Time)

Thanks for confirming. I guess in order to make it work for llama, we will need to implement something similar to what you have done for MPT to support attn_uses_sequence_id. I will share our findings and code once it's done.

Yanan Xie · Answer 3 · Sun Jan 14 2024 15:55:39 GMT+0800 (China Standard Time)

Hi @dakinggg, a relevant question, when I set packing_ratio=2, the data loader fetches 2 x per_device_batch_size examples for each device batch. What happens to those overflow examples / tokens? Do we just drop them?

Yanan Xie · Answer 4 · Wed Jan 17 2024 06:52:41 GMT+0800 (China Standard Time)

I have tried training without 3d attention mask on very short examples. Packing ratio is > 100. The train loss didn't converge.

Daniel King · Answer 5 · Wed Jan 17 2024 06:56:08 GMT+0800 (China Standard Time)

When you set packing ratio, yes it will concatenate samples together. As for convergence, there could be many things that affect this. We haven't seen any issues with sequence packing not converging. One thing is that if you don't change any other hyperparameters, sequence packing means you have (many) more tokens per batch than before, so a different learning rate may be required.

Yanan Xie · Answer 6 · Wed Jan 17 2024 08:26:01 GMT+0800 (China Standard Time)

When you set packing ratio, yes it will concatenate samples together. As for convergence, there could be many things that affect this. We haven't seen any issues with sequence packing not converging. One thing is that if you don't change any other hyperparameters, sequence packing means you have (many) more tokens per batch than before, so a different learning rate may be required.

You are right... I needed to train the model for a few more steps with a larger LR. Thanks for the help.

Yanan Xie · Answer 7 · Wed Jan 17 2024 16:42:43 GMT+0800 (China Standard Time)

I have implemented the 3d attention mask for Mistral. (basically by adding sequence_id support)
It seems 3d attention mask does help. Please see the below plots.

(The eval set is a subset of train set with packing ratio = 1 for infra validation purposes)

Also, I had to switch to transformers@4.35.2 from 4.36.2 in order to make my patch work. Otherwise, the loss won't converge. I think it's related to a github issue you created in the transformers repo.

Daniel King · Answer 8 · Thu Jan 18 2024 01:55:37 GMT+0800 (China Standard Time)

I'd be curious to know if that model is actually better. It looks like your dataset was pretty easy and the model got it completely (0 loss). It seems reasonable that the packed, non masked version would converge to not quite 0, since it can't predict the first token after an eos. But the models may be pretty much identical otherwise.

Yanan Xie · Answer 9 · Thu Jan 18 2024 03:42:10 GMT+0800 (China Standard Time)

In this new experiment, the train eval set doesn't overlap the train set. In this way, we test the model's generalization capability w/ or w/o 3d attention mask. Indeed, the model is not actually better with attention mask. But it does seem to allow the model to train faster (get to the eval/loss point with fewer steps).

Daniel King · Answer 10 · Fri Apr 05 2024 05:20:57 GMT+0800 (China Standard Time)

Closing as I think the question has been answered. Fell free to open a new issue if you have more questions!