DachengLi1 / LongChat

Official repository for LongChat and LongEval

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How to use 3090 to train 16k model?

aresa7796 opened this issue · comments

I have 80k supervised data, but only 3090 graphics card, how to use 3090 to train 16k model?

While technically it can work, its probably gonna take too much VRAM and will be horribly slow.
Checkout:
https://huggingface.co/docs/transformers/perf_train_gpu_one

@aresa7796 The current code is assuming 8xA100 40GB. I think 3090 should be able to run after applying some system techniques. I think if we can support training for 3090 GPUs (or non-A100), it will be really amazing. We just didn't get a hand on it now, can you try and share some of your feedback? Here are the steps I think should work:

(1) Use deepspeed zero offloading as shared by @musabgultekin ;
(2) Change the monkey patch from flash attention to xformer by calling this function. Xformer is a memory efficient attention which supports non-a100 GPUs. I already have the monkey patch implemented.:P
(3) Change bf16 (delete the tf32 argument as well) to fp16 in the training command.

Let me know if this works for you!

Am also wondering for this. For instance, using v100 which might not possible feed 2048 at all, if using 1024 and applying condensing rotary embeddings in a ratio of 16, will work? How good?

@lucasjinreal condensing rotary does not reduce memory, it only makes model good quality with 16K.

@DachengLi1 what I menas, v100 can not feed too much minimal len like 2048 for most cases.

@lucasjinreal i see thanks! Condensing will be great, I believe it should work from 1024 to 8192 say. But the thing is you will still need to fine-tune on the longer length a bit after condensing - but I guess you can resort to A100 for that adapting part?

@DachengLi1 hi, wanna discuss a bit more, have u tried compare with your method with ALibi on Extrapolation ability?