Why FSDPStrategy is so slow-down when I use multi-machine

Question

Why FSDPStrategy is so slow-down when I use multi-machine

Graduo opened this issue a month ago · comments

Hello
I was struggling to train a 1.5B LlaMA ,but I observed an unexpectedly slow-down when using FSDP strategy in 2 devices.
'''
FLOPs not found for 'NVIDIA H800'
Measured TFLOPs: 2539.13
Epoch 1 | iter 16 step 1 | loss train: 8.515, val: n/a | iter time: 26133.73 ms (step) remaining time: 909 days, 3:20:37
Epoch 1 | iter 32 step 2 | loss train: 8.509, val: n/a | iter time: 26446.05 ms (step) remaining time: 635 days, 12:08:15
Epoch 1 | iter 48 step 3 | loss train: 8.491, val: n/a | iter time: 26204.95 ms (step) remaining time: 543 days, 2:07:38
Epoch 1 | iter 64 step 4 | loss train: 8.472, val: n/a | iter time: 26227.60 ms (step) remaining time: 496 days, 22:22:41
Epoch 1 | iter 80 step 5 | loss train: 8.492, val: n/a | iter time: 26297.45 ms (step) remaining time: 469 days, 9:35:18
Epoch 1 | iter 96 step 6 | loss train: 8.395, val: n/a | iter time: 25975.68 ms (step) remaining time: 450 days, 10:46:30
Epoch 1 | iter 112 step 7 | loss train: 8.383, val: n/a | iter time: 26152.08 ms (step) remaining time: 437 days, 4:40:59
Epoch 1 | iter 128 step 8 | loss train: 8.314, val: n/a | iter time: 26192.78 ms (step) remaining time: 427 days, 22:27:04
Epoch 1 | iter 144 step 9 | loss train: 8.411, val: n/a | iter time: 26267.13 ms (step) remaining time: 420 days, 6:28:56
'''
And when train it in single machine , the iter time is around 700ms.
could I get any idea about the reason and how can I fix it? Thank you!

Luca Antiga · Answer 1 · Tue Apr 30 2024 02:22:57 GMT+0800 (China Standard Time)

Hi, can you post the CLI args or code you are using?
Also this is with two machines and 8 GPUs per machine?

Luca Antiga · Answer 2 · Tue Apr 30 2024 02:27:40 GMT+0800 (China Standard Time)

Just to confirm: are you running the pretraining command?

Maybe try to comment this line out: https://github.com/Lightning-AI/litgpt/blob/main/litgpt/pretrain.py#L174

We have bumped into issues with PyTorch 2.2 and torch.compile recently, let's take this variable out of the equation.

Graduo · Answer 3 · Tue Apr 30 2024 22:14:19 GMT+0800 (China Standard Time)

Hi, can you post the CLI args or code you are using? Also this is with two machines and 8 GPUs per machine?

HI thanks for your prompt reply! Yes , I use two machines and 8 GPUs per machine. I just use args like
`fabric run --node-rank=0 --main-address=ip1 --accelerator=cuda --devices=8 --num-nodes=2 litgpt/pretrain_multinode_myllama.py --config config_hub/pretrain/myllama.yaml

fabric run --node-rank=1 --main-address=ip1 --accelerator=cuda --devices=8 --num-nodes=2 litgpt/pretrain_multinode_myllama.py --config config_hub/pretrain/myllama.yaml`

I once tried using 'litgpt run', but it did not work successfully. Based on the suggestions, I changed it to 'fabric run' .And the code about strategy

strategy = FSDPStrategy(auto_wrap_policy={Block}, state_dict_type="full", sharding_strategy="HYBRID_SHARD")

Graduo · Answer 4 · Tue Apr 30 2024 22:17:12 GMT+0800 (China Standard Time)

Just to confirm: are you running the pretraining command?

Maybe try to comment this line out: https://github.com/Lightning-AI/litgpt/blob/main/litgpt/pretrain.py#L174

We have bumped into issues with PyTorch 2.2 and torch.compile recently, let's take this variable out of the equation.

Yeah,I am running the pretraining command