mosaicml / examples

Fast and flexible reference benchmarks

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

MosaicML LLM: 'key_padding_mask' is NoneType when setting "attn_impl: torch"

howiejayz opened this issue · comments

Hi Team,

Everything worked fine with "attn_impl: flash". But when I tried to train the LLM models without the FlashAttention by setting "attn_impl: torch" in the yamls, the following error occurs.

Traceback (most recent call last):
File "/workspace/examples_latest/examples/llm/main.py", line 215, in
main(cfg)
File "/workspace/examples_latest/examples/llm/main.py", line 204, in main
trainer.fit()
File "/usr/lib/python3/dist-packages/composer/trainer/trainer.py", line 1787, in fit
self._train_loop()
File "/usr/lib/python3/dist-packages/composer/trainer/trainer.py", line 1950, in _train_loop
total_loss_dict = self._train_batch(use_grad_scaling)
File "/usr/lib/python3/dist-packages/composer/trainer/trainer.py", line 2126, in _train_batch
optimizer.step(closure=lambda **kwargs: self._train_microbatches(
File "/usr/lib/python3/dist-packages/torch/optim/lr_scheduler.py", line 68, in wrapper
return wrapped(*args, **kwargs)
File "/usr/lib/python3/dist-packages/torch/optim/optimizer.py", line 140, in wrapper
out = func(*args, **kwargs)
File "/usr/lib/python3/dist-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/usr/lib/python3/dist-packages/composer/optim/decoupled_weight_decay.py", line 289, in step
loss = closure()
File "/usr/lib/python3/dist-packages/composer/trainer/trainer.py", line 2126, in
optimizer.step(closure=lambda **kwargs: self._train_microbatches(
File "/usr/lib/python3/dist-packages/composer/trainer/trainer.py", line 2209, in _train_microbatches
microbatch_loss_dict = self._train_microbatch(use_grad_scaling, current_batch_size, is_final_microbatch)
File "/usr/lib/python3/dist-packages/composer/trainer/trainer.py", line 2255, in _train_microbatch
self.state.outputs = self.state.model(self.state.batch)
File "/usr/lib/python3/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/workspace/examples_latest/examples/llm/src/models/mosaic_gpt.py", line 255, in forward
return self.model(input_ids=input_ids,
File "/usr/lib/python3/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/lib/python3/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 2727, in forward
output = self._fsdp_wrapped_module(*args, **kwargs)
File "/usr/lib/python3/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/lib/python3/dist-packages/torch/distributed/fsdp/flatten_params_wrapper.py", line 165, in forward
return self.module(*inputs, **kwinputs)
File "/usr/lib/python3/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/workspace/examples_latest/examples/llm/src/models/mosaic_gpt.py", line 158, in forward
x = block(x, mod_key_padding_mask, attn_mask)
File "/usr/lib/python3/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/lib/python3/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 2727, in forward
output = self._fsdp_wrapped_module(*args, **kwargs)
File "/usr/lib/python3/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/lib/python3/dist-packages/torch/distributed/fsdp/flatten_params_wrapper.py", line 165, in forward
return self.module(*inputs, **kwinputs)
File "/usr/lib/python3/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/workspace/examples_latest/examples/llm/src/models/layers/gpt_blocks.py", line 53, in forward
b, _ = self.causal_attn(a, key_padding_mask, attn_mask)
File "/usr/lib/python3/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/workspace/examples_latest/examples/llm/src/models/layers/attention.py", line 39, in forward
key_padding_mask=~key_padding_mask,
TypeError: bad operand type for unary ~: 'NoneType'

Thank for noting the issue.
Fix in progress: #211

#211 merged. Should work now