[BUG] grad_scale_func is not working when using Float16OptimizerWithFloat16Params, causing slow loss drops in fp16

Question

[BUG] grad_scale_func is not working when using Float16OptimizerWithFloat16Params, causing slow loss drops in fp16

liaosnow opened this issue 7 months ago · comments

hi, find a problem that grad_scale_func is not working when using Float16OptimizerWithFloat16Params.
269 lines of code didn't work in Megatron-DeepSpeed/megatron/core/pipeline_parallel/schedules.py.
"output_tensor[0] = config.grad_scale_func(output_tensor[0])"
Here config.grad_scale_func is None.

1144 lines of code in Megatron-DeepSpeed/megatron/training.py.
"config.grad_scale_func = optimizer.scale_loss" works, but "config = get_model_config(model)" doesn't look like it's working.

Can you verify and fix this?

Wu Houming · Answer 1 · Thu Apr 18 2024 12:01:40 GMT+0800 (China Standard Time)

Same problem! Can you fix it? @liaosnow

liaosnow · Answer 2 · Thu Apr 18 2024 13:50:32 GMT+0800 (China Standard Time)

Take “config“ as a function parameter， to replace "config = get_model_config(model)".
Above methods can work round.