microsoft / Megatron-DeepSpeed

Ongoing research training transformer language models at scale, including: BERT & GPT-2

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[BUG] grad_scale_func is not working when using Float16OptimizerWithFloat16Params, causing slow loss drops in fp16

liaosnow opened this issue · comments

hi, find a problem that grad_scale_func is not working when using Float16OptimizerWithFloat16Params.
269 lines of code didn't work in Megatron-DeepSpeed/megatron/core/pipeline_parallel/schedules.py.
"output_tensor[0] = config.grad_scale_func(output_tensor[0])"
Here config.grad_scale_func is None.

1144 lines of code in Megatron-DeepSpeed/megatron/training.py.
"config.grad_scale_func = optimizer.scale_loss" works, but "config = get_model_config(model)" doesn't look like it's working.

Can you verify and fix this?

Same problem! Can you fix it? @liaosnow

Take “config“ as a function parameter, to replace "config = get_model_config(model)".
Above methods can work round.