有试过模型并行的朋友吗?
XiaoqingNLP opened this issue · comments
- 目前发现mp-size > 1 时,损失下降到10左右开始稳定;mp == 1,损失能稳定下降到5;
- 使用glm下面的finetune-sst2 脚本执行有如下问题,这个当时咱们有测试过吗?
- 这里加载的glm模型是不是不能直接用GLM中云盘提供的模型?[已尝试加载,assert 显示embedding weight 未初始化等问题]
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [40,0,0], thread: [75,0,0] Assertion
srcIndex < srcSelectDimSize
failed.
- 请问您是在什么模型下实验的,用的什么数据集?
- 这个是在哪一步报错的,可以提供更完整的报错信息吗?
- 不能。可以看SwissArmyTransformer/resources/urls.py这个文件里面的路径下载或者使用AutoModel.from_pretrained
@yzy-thu 谢谢你的回答
- 这个是我基于咱们的框架开发的一个模型,数据是一个通用的新闻单语数据,目前mp=1 能正常收敛,所以我想排查一下mp=2是为什么不收敛,所以想试一下官方提供的例子能不能正常收敛,才有后面几个问题:
- 错误是我想跑一下finetune-sst2 验证一下模型并行是否正常,错误位置如下:
File "~/anaconda3/envs/sat/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "~/anaconda3/envs/sat/lib/python3.9/site-packages/SwissArmyTransformer/model/transformer.py", line 330, in forward
return HOOKS_DEFAULT['layer_forward'](self, hidden_states, mask, *args, **kw_args)
File "~/anaconda3/envs/sat/lib/python3.9/site-packages/SwissArmyTransformer/transformer_defaults.py", line 161, in layer_forward_default
mlp_output = self.mlp(mlp_input, **kw_args)
File "~/anaconda3/envs/sat/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "~/anaconda3/envs/sat/lib/python3.9/site-packages/SwissArmyTransformer/model/transformer.py", line 225, in forward
output = HOOKS_DEFAULT['mlp_forward'](self, hidden_states, **kw_args)
File "~/anaconda3/envs/sat/lib/python3.9/site-packages/SwissArmyTransformer/transformer_defaults.py", line 112, in mlp_forward_default
intermediate_parallel = self.activation_func(intermediate_parallel)
File "~/anaconda3/envs/sat/lib/python3.9/site-packages/SwissArmyTransformer/mpu/utils.py", line 98, in gelu
return gelu_impl(x)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: CUDA driver error: device-side assert triggered
- 那这里的模型SwissArmyTransformer/resources/urls.py 是基于这个框架训练的吗?还是基于别的库训练,然后转换过来的?
- 看examples 中多是mp=1,咱们有测过mp>1 的例子吗?
@yzy-thu 谢谢你的回答
- 这个是我基于咱们的框架开发的一个模型,数据是一个通用的新闻单语数据,目前mp=1 能正常收敛,所以我想排查一下mp=2是为什么不收敛,所以想试一下官方提供的例子能不能正常收敛,才有后面几个问题:
- 错误是我想跑一下finetune-sst2 验证一下模型并行是否正常,错误位置如下:
File "~/anaconda3/envs/sat/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "~/anaconda3/envs/sat/lib/python3.9/site-packages/SwissArmyTransformer/model/transformer.py", line 330, in forward
return HOOKS_DEFAULT['layer_forward'](self, hidden_states, mask, *args, **kw_args)
File "~/anaconda3/envs/sat/lib/python3.9/site-packages/SwissArmyTransformer/transformer_defaults.py", line 161, in layer_forward_default
mlp_output = self.mlp(mlp_input, **kw_args)
File "~/anaconda3/envs/sat/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "~/anaconda3/envs/sat/lib/python3.9/site-packages/SwissArmyTransformer/model/transformer.py", line 225, in forward
output = HOOKS_DEFAULT['mlp_forward'](self, hidden_states, **kw_args)
File "~/anaconda3/envs/sat/lib/python3.9/site-packages/SwissArmyTransformer/transformer_defaults.py", line 112, in mlp_forward_default
intermediate_parallel = self.activation_func(intermediate_parallel)
File "~/anaconda3/envs/sat/lib/python3.9/site-packages/SwissArmyTransformer/mpu/utils.py", line 98, in gelu
return gelu_impl(x)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: CUDA driver error: device-side assert triggered
- 那这里的模型SwissArmyTransformer/resources/urls.py 是基于这个框架训练的吗?还是基于别的库训练,然后转换过来的?
- 看examples 中多是mp=1,咱们有测过mp>1 的例子吗?
我也在本地添加了一个新的custom model,遇到了类似的问题:RuntimeError: CUDA driver error: device-side assert triggered
. 能了解一下你是如何解决的吗
mp>1的例子最近支持了,参考https://github.com/THUDM/SwissArmyTransformer/blob/main/examples/llama/split_model.py
这个需求是说模型本身的mp=1,但是我们希望再次训练的时候使用mp>1,只需要在from_pretrained(xxx, overwrite_args={'model_parallel_size': 2}
即可在mp=2下继续训练。