FSDP returns different loss value with zero stage 2 and 3

Question

FSDP returns different loss value with zero stage 2 and 3

dongsungkim opened this issue 2 years ago · comments

How to reproduce

CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nnodes=1 --nproc_per_node=2  ./tests/torch/nn/parallel/data_parallel/test_fsdp.py --zero-stage 2

Environment

OS : ubuntu18.04
Python version : python3.7
Transformers version : 4.21.2
Whether to use Docker:
Misc.:

dongsungkim · Answer 1 · Mon Oct 17 2022 03:51:28 GMT+0800 (China Standard Time)

No optimiser implementation in oslo/torch/nn/parallel/data_parallel/data_parallel.py.
It will be added for zero-stage 2 and 3.

In addition to that, Need to check cpu_offload in FSDP code.