Sequence Parallel is incompatible with Rotary Positional Embedding
anogkongda opened this issue · comments
I would like to finetune llama2 on long sequence data. (more than or eq 32K)
I follow the example below for sequence parallel:
Sadly, the lm loss is NaN if I use rotary positional embedding.
When I disable rotary positional embedding, the loss is ok even other parameters/arguments are the same as before.
After testing, I found the following:
-
Reducing the model size (e.g., the original 32-layer LLaMA 7B reduced to 16 layers) prevents the loss from becoming NaN.
-
Switching from BF16 to FP16 also prevents the loss from becoming NaN.
-
When the loss becomes NaN, there's no protection mechanism, which causes all model parameters to turn into NaN.
-
When Sequence Parallel is enabled, the BF16 Optimizer might overflow under certain circumstances, potentially due to computational errors.
-
Observing the trend of loss change in FP16 training is still ongoing.
hi, @anogkongda, I also encountered the NAN issue and resolved it with this #399, could you try this. Can it solve your problem?
hi, @anogkongda, I also encountered the NAN issue and resolved it with this #399, could you try this. Can it solve your problem?
thank you, I will try this and report my result ASAP.