Lightning-AI / litgpt

Pretrain, finetune, deploy 20+ LLMs on your own data. Uses state-of-the-art techniques: flash attention, FSDP, 4-bit, LoRA, and more.

Home Page:https://lightning.ai

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Why FSDPStrategy is so slow-down when I use multi-machine

Graduo opened this issue · comments

Hello
I was struggling to train a 1.5B LlaMA ,but I observed an unexpectedly slow-down when using FSDP strategy in 2 devices.
'''
FLOPs not found for 'NVIDIA H800'
Measured TFLOPs: 2539.13
Epoch 1 | iter 16 step 1 | loss train: 8.515, val: n/a | iter time: 26133.73 ms (step) remaining time: 909 days, 3:20:37
Epoch 1 | iter 32 step 2 | loss train: 8.509, val: n/a | iter time: 26446.05 ms (step) remaining time: 635 days, 12:08:15
Epoch 1 | iter 48 step 3 | loss train: 8.491, val: n/a | iter time: 26204.95 ms (step) remaining time: 543 days, 2:07:38
Epoch 1 | iter 64 step 4 | loss train: 8.472, val: n/a | iter time: 26227.60 ms (step) remaining time: 496 days, 22:22:41
Epoch 1 | iter 80 step 5 | loss train: 8.492, val: n/a | iter time: 26297.45 ms (step) remaining time: 469 days, 9:35:18
Epoch 1 | iter 96 step 6 | loss train: 8.395, val: n/a | iter time: 25975.68 ms (step) remaining time: 450 days, 10:46:30
Epoch 1 | iter 112 step 7 | loss train: 8.383, val: n/a | iter time: 26152.08 ms (step) remaining time: 437 days, 4:40:59
Epoch 1 | iter 128 step 8 | loss train: 8.314, val: n/a | iter time: 26192.78 ms (step) remaining time: 427 days, 22:27:04
Epoch 1 | iter 144 step 9 | loss train: 8.411, val: n/a | iter time: 26267.13 ms (step) remaining time: 420 days, 6:28:56
'''
And when train it in single machine , the iter time is around 700ms.
could I get any idea about the reason and how can I fix it? Thank you!

Hi, can you post the CLI args or code you are using?
Also this is with two machines and 8 GPUs per machine?

Just to confirm: are you running the pretraining command?

Maybe try to comment this line out: https://github.com/Lightning-AI/litgpt/blob/main/litgpt/pretrain.py#L174

We have bumped into issues with PyTorch 2.2 and torch.compile recently, let's take this variable out of the equation.

Hi, can you post the CLI args or code you are using? Also this is with two machines and 8 GPUs per machine?

HI thanks for your prompt reply! Yes , I use two machines and 8 GPUs per machine. I just use args like
`fabric run --node-rank=0 --main-address=ip1 --accelerator=cuda --devices=8 --num-nodes=2 litgpt/pretrain_multinode_myllama.py --config config_hub/pretrain/myllama.yaml

fabric run --node-rank=1 --main-address=ip1 --accelerator=cuda --devices=8 --num-nodes=2 litgpt/pretrain_multinode_myllama.py --config config_hub/pretrain/myllama.yaml`

I once tried using 'litgpt run', but it did not work successfully. Based on the suggestions, I changed it to 'fabric run' .And the code about strategy

strategy = FSDPStrategy(auto_wrap_policy={Block}, state_dict_type="full", sharding_strategy="HYBRID_SHARD")

Just to confirm: are you running the pretraining command?

Maybe try to comment this line out: https://github.com/Lightning-AI/litgpt/blob/main/litgpt/pretrain.py#L174

We have bumped into issues with PyTorch 2.2 and torch.compile recently, let's take this variable out of the equation.

Yeah,I am running the pretraining command