How to use 3090 to train 16k model?

Question

How to use 3090 to train 16k model?

aresa7796 opened this issue a year ago · comments

I have 80k supervised data, but only 3090 graphics card, how to use 3090 to train 16k model?

Musab Gultekin · Answer 1 · Sun Jul 02 2023 02:14:22 GMT+0800 (China Standard Time)

While technically it can work, its probably gonna take too much VRAM and will be horribly slow.
Checkout:
https://huggingface.co/docs/transformers/perf_train_gpu_one

Dacheng Li · Answer 2 · Sun Jul 02 2023 04:15:18 GMT+0800 (China Standard Time)

@aresa7796 The current code is assuming 8xA100 40GB. I think 3090 should be able to run after applying some system techniques. I think if we can support training for 3090 GPUs (or non-A100), it will be really amazing. We just didn't get a hand on it now, can you try and share some of your feedback? Here are the steps I think should work:

(1) Use deepspeed zero offloading as shared by @musabgultekin ;
(2) Change the monkey patch from flash attention to xformer by calling this function. Xformer is a memory efficient attention which supports non-a100 GPUs. I already have the monkey patch implemented.:P
(3) Change bf16 (delete the tf32 argument as well) to fp16 in the training command.

Let me know if this works for you!

MagicSource · Answer 3 · Mon Jul 10 2023 16:34:25 GMT+0800 (China Standard Time)

Am also wondering for this. For instance, using v100 which might not possible feed 2048 at all, if using 1024 and applying condensing rotary embeddings in a ratio of 16, will work? How good?

Dacheng Li · Answer 4 · Mon Jul 10 2023 23:20:51 GMT+0800 (China Standard Time)

@lucasjinreal condensing rotary does not reduce memory, it only makes model good quality with 16K.

MagicSource · Answer 5 · Tue Jul 11 2023 10:25:36 GMT+0800 (China Standard Time)

@DachengLi1 what I menas, v100 can not feed too much minimal len like 2048 for most cases.

Dacheng Li · Answer 6 · Tue Jul 11 2023 10:38:41 GMT+0800 (China Standard Time)

@lucasjinreal i see thanks! Condensing will be great, I believe it should work from 1024 to 8192 say. But the thing is you will still need to fine-tune on the longer length a bit after condensing - but I guess you can resort to A100 for that adapting part?

MagicSource · Answer 7 · Tue Jul 11 2023 10:51:51 GMT+0800 (China Standard Time)

@DachengLi1 hi, wanna discuss a bit more, have u tried compare with your method with ALibi on Extrapolation ability?