CanNot Finetune deepseek-coder-v2-lite via modeling_deepseek.py

Question

CanNot Finetune deepseek-coder-v2-lite via modeling_deepseek.py

chencyudel opened this issue 2 months ago · comments

You may have som bug on type manipulation and thus the model can not be finetuned via DeepSpeed(bf16 mix precision)

File "/deepseek_v2/modeling_deepseek.py", line 1252, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/deepseek_v2/modeling_deepseek.py", line 953, in forward
q = self.q_proj(hidden_states)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 114, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: expected mat1 and mat2 to have the same dtype, but got: float != c10::BFloat16

lihaoling commented 2 months ago

+1

Chaoyu Chen · Answer 1 · Tue Jun 18 2024 17:20:14 GMT+0800 (China Standard Time)

@lihaoling I have found the reason which is the "forward" problem in DeepseekV2MoE module. An easy but maybe not best fix is to cast the dtype of hidden_states back at the end of "forward" method of DeepseekV2MoE.
For example:

def forward(self, hidden_states):
        # save dtype before computation
        input_dtype = hidden_states.dtype
        identity = hidden_states
        orig_shape = hidden_states.shape
        topk_idx, topk_weight, aux_loss = self.gate(hidden_states)
        hidden_states = hidden_states.view(-1, hidden_states.shape[-1])
        flat_topk_idx = topk_idx.view(-1)
        if self.training:
            hidden_states = hidden_states.repeat_interleave(
                self.num_experts_per_tok, dim=0
            )
            y = torch.empty_like(hidden_states)
            for i, expert in enumerate(self.experts):
                y[flat_topk_idx == i] = expert(hidden_states[flat_topk_idx == i])
            y = (y.view(*topk_weight.shape, -1) * topk_weight.unsqueeze(-1)).sum(dim=1)
            y = y.view(*orig_shape)
            y = AddAuxiliaryLoss.apply(y, aux_loss)
        else:
            y = self.moe_infer(hidden_states, topk_idx, topk_weight).view(*orig_shape)
        if self.config.n_shared_experts is not None:
            y = y + self.shared_experts(identity)
        # keep dtype same after moe forward
        return y.to(input_dtype)

lihaoling · Answer 2 · Wed Jun 19 2024 12:32:49 GMT+0800 (China Standard Time)

@chencyudel Thanks! I have it running successfully.

pierowu · Answer 3 · Fri Jun 21 2024 17:03:56 GMT+0800 (China Standard Time)

I‘ve also encountered this problem ans solved it by the way above. I'm wondering if there is more elegant way to solve it ?

lihaoling · Answer 4 · Sat Jun 22 2024 16:16:35 GMT+0800 (China Standard Time)

I guess this is due to mixed precision training, e.g. this paper Switch Transformers uses float32 precision in the router function and bfloat 16 precision in others. But not sure there is an elegant way to replace such a dtype conversion.

gtpgg1013 · Answer 5 · Mon Jun 24 2024 15:54:36 GMT+0800 (China Standard Time)

Thanks a lot! I have it running successfully, but it looks so slow. I used deepspeed zero stage 3 for distributed training, anyone who encountered similar situation?

pierowu · Answer 6 · Tue Jun 25 2024 09:34:33 GMT+0800 (China Standard Time)

I guess this is due to mixed precision training, e.g. this paper Switch Transformers uses float32 precision in the router function and bfloat 16 precision in others. But not sure there is an elegant way to replace such a dtype conversion.

But when I use bfloat16 to load the model for inference, the error won't come out. It seems that during the forward process of inference, the fp32 in the router function will be transformed into bf16 automatically.

Daya Guo · Answer 7 · Wed Jul 03 2024 13:11:56 GMT+0800 (China Standard Time)

We have updated the modeling_deepseek.py file. Please refer to
https://huggingface.co/deepseek-ai/DeepSeek-Coder-V2-Lite-Base/commit/e5e79b92c85f8f9182ec006b575483227201fd5e