[BUG] An error occurs due to mismatched shapes during the process of splitting mixed_x_layer

Question

[BUG] An error occurs due to mismatched shapes during the process of splitting mixed_x_layer

RockMiin opened this issue 6 months ago · comments

I am currently in the process of pretraining GPT, and I encountered an error in the split_tensor function in megatron/model/transformer.py. The split_tensor function is documented as transforming [sq, b, nkv, (nq // nkv + 2), hn] to 3 [sq, b, np, hn]. During the process of reshaping the query_layer, I think it is correct to use mixed_x_layer.shape[:-2] instead of mixed_x_layer.shape[:-1].

def split_tensor(self, mixed_x_layer):
        query_layer = mixed_x_layer[:, :, :, :-2, :].reshape(mixed_x_layer.shape[:-2] + (-1, self.hidden_size_per_attention_head))
        key_layer = mixed_x_layer[:, :, :, -2, :]
        value_layer = mixed_x_layer[:, :, :, -1, :]

        return query_layer, key_layer, value_layer

RockMiin · Answer 1 · Wed Nov 29 2023 13:54:59 GMT+0800 (China Standard Time)

I encountered the following error.

[default0]:    query_layer = mixed_x_layer[:, :, :, :-2, :].reshape(mixed_x_layer.shape[:-1] + (-1, self.hidden_size_per_attention_head))
[default0]:RuntimeError: shape '[512, 1, 2, 3, -1, 4]' is invalid for input of size 4096

Wu Houming · Answer 2 · Thu Nov 30 2023 17:26:58 GMT+0800 (China Standard Time)

I encountered the same issue with the code in the latest main branch, and I'm unable to fix the problem. However, the code works when switching to the commit with the hash 2348eed on Nov 17, 2023.

Bingxu Zhu · Answer 3 · Thu Nov 30 2023 19:41:39 GMT+0800 (China Standard Time)

I also encountered the same issue with the code in the latest main branch

CheckpointFunction.apply(function, all_outputs, *args)
  File "/home/anaconda3/envs/zbx1/lib/python3.10/site-packages/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 566, in forward
    outputs = run_function(*inputs_cuda)
  File "/home/zbx/Megatron-DeepSpeed/megatron/model/transformer.py", line 1729, in custom_forward
    output = layer(x_, *args, **kwargs)
  File "/home/anaconda3/envs/zbx1/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/home/wangzhigangcs/zbx/Megatron-DeepSpeed/megatron/model/transformer.py", line 1222, in forward
    self.self_attention(
  File "/home/anaconda3/envs/zbx1/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/home/zbx/Megatron-DeepSpeed/megatron/model/transformer.py", line 694, in forward
    value_layer) = self.split_tensor(mixed_x_layer)
  File "/home/zbx/Megatron-DeepSpeed/megatron/model/transformer.py", line 647, in split_tensor
    query_layer = mixed_x_layer[:, :, :, :-2, :].reshape(mixed_x_layer.shape[:-1] + (-1, self.hidden_size_per_attention_head))
RuntimeError: shape '[2048, 2, 2, 3, -1, 64]' is invalid for input of size 524288

RockMiin · Answer 4 · Mon Dec 04 2023 08:35:00 GMT+0800 (China Standard Time)

Thank you all for your comments. I checked that there was a related #307 PR three days ago. It would be nice to refer to that.