THUDM / SwissArmyTransformer

SwissArmyTransformer is a flexible and powerful library to develop your own Transformer variants.

Home Page:https://THUDM.github.io/SwissArmyTransformer

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

有试过模型并行的朋友吗?

XiaoqingNLP opened this issue · comments

  1. 目前发现mp-size > 1 时,损失下降到10左右开始稳定;mp == 1,损失能稳定下降到5;
  2. 使用glm下面的finetune-sst2 脚本执行有如下问题,这个当时咱们有测试过吗?
  3. 这里加载的glm模型是不是不能直接用GLM中云盘提供的模型?[已尝试加载,assert 显示embedding weight 未初始化等问题]

../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [40,0,0], thread: [75,0,0] Assertion srcIndex < srcSelectDimSize failed.

  1. 请问您是在什么模型下实验的,用的什么数据集?
  2. 这个是在哪一步报错的,可以提供更完整的报错信息吗?
  3. 不能。可以看SwissArmyTransformer/resources/urls.py这个文件里面的路径下载或者使用AutoModel.from_pretrained

@yzy-thu 谢谢你的回答

  1. 这个是我基于咱们的框架开发的一个模型,数据是一个通用的新闻单语数据,目前mp=1 能正常收敛,所以我想排查一下mp=2是为什么不收敛,所以想试一下官方提供的例子能不能正常收敛,才有后面几个问题:
  2. 错误是我想跑一下finetune-sst2 验证一下模型并行是否正常,错误位置如下:

File "~/anaconda3/envs/sat/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "~/anaconda3/envs/sat/lib/python3.9/site-packages/SwissArmyTransformer/model/transformer.py", line 330, in forward
return HOOKS_DEFAULT['layer_forward'](self, hidden_states, mask, *args, **kw_args)
File "~/anaconda3/envs/sat/lib/python3.9/site-packages/SwissArmyTransformer/transformer_defaults.py", line 161, in layer_forward_default
mlp_output = self.mlp(mlp_input, **kw_args)
File "~/anaconda3/envs/sat/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "~/anaconda3/envs/sat/lib/python3.9/site-packages/SwissArmyTransformer/model/transformer.py", line 225, in forward
output = HOOKS_DEFAULT['mlp_forward'](self, hidden_states, **kw_args)
File "~/anaconda3/envs/sat/lib/python3.9/site-packages/SwissArmyTransformer/transformer_defaults.py", line 112, in mlp_forward_default
intermediate_parallel = self.activation_func(intermediate_parallel)
File "~/anaconda3/envs/sat/lib/python3.9/site-packages/SwissArmyTransformer/mpu/utils.py", line 98, in gelu
return gelu_impl(x)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: CUDA driver error: device-side assert triggered

  1. 那这里的模型SwissArmyTransformer/resources/urls.py 是基于这个框架训练的吗?还是基于别的库训练,然后转换过来的?
  2. 看examples 中多是mp=1,咱们有测过mp>1 的例子吗?

@yzy-thu 谢谢你的回答

  1. 这个是我基于咱们的框架开发的一个模型,数据是一个通用的新闻单语数据,目前mp=1 能正常收敛,所以我想排查一下mp=2是为什么不收敛,所以想试一下官方提供的例子能不能正常收敛,才有后面几个问题:
  2. 错误是我想跑一下finetune-sst2 验证一下模型并行是否正常,错误位置如下:

File "~/anaconda3/envs/sat/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "~/anaconda3/envs/sat/lib/python3.9/site-packages/SwissArmyTransformer/model/transformer.py", line 330, in forward
return HOOKS_DEFAULT['layer_forward'](self, hidden_states, mask, *args, **kw_args)
File "~/anaconda3/envs/sat/lib/python3.9/site-packages/SwissArmyTransformer/transformer_defaults.py", line 161, in layer_forward_default
mlp_output = self.mlp(mlp_input, **kw_args)
File "~/anaconda3/envs/sat/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "~/anaconda3/envs/sat/lib/python3.9/site-packages/SwissArmyTransformer/model/transformer.py", line 225, in forward
output = HOOKS_DEFAULT['mlp_forward'](self, hidden_states, **kw_args)
File "~/anaconda3/envs/sat/lib/python3.9/site-packages/SwissArmyTransformer/transformer_defaults.py", line 112, in mlp_forward_default
intermediate_parallel = self.activation_func(intermediate_parallel)
File "~/anaconda3/envs/sat/lib/python3.9/site-packages/SwissArmyTransformer/mpu/utils.py", line 98, in gelu
return gelu_impl(x)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: CUDA driver error: device-side assert triggered

  1. 那这里的模型SwissArmyTransformer/resources/urls.py 是基于这个框架训练的吗?还是基于别的库训练,然后转换过来的?
  2. 看examples 中多是mp=1,咱们有测过mp>1 的例子吗?

我也在本地添加了一个新的custom model,遇到了类似的问题:RuntimeError: CUDA driver error: device-side assert triggered . 能了解一下你是如何解决的吗

mp>1的例子最近支持了,参考https://github.com/THUDM/SwissArmyTransformer/blob/main/examples/llama/split_model.py

这个需求是说模型本身的mp=1,但是我们希望再次训练的时候使用mp>1,只需要在from_pretrained(xxx, overwrite_args={'model_parallel_size': 2}即可在mp=2下继续训练。