THUDM / GLM-130B

GLM-130B: An Open Bilingual Pre-Trained Model (ICLR 2023)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

V100(8 * 32G)运行报错

yihuaxiang opened this issue · comments

下载模型并解压后修改配置并运行bash scripts/generate.sh --input-source interactive 后报错

配置修改内容如下:

(py39) [root@iZbp1219pbxs72sxk8onovZ GLM-130B]# git diff
diff --git a/configs/model_glm_130b_v100.sh b/configs/model_glm_130b_v100.sh
index 0b33485..1a474a8 100644
--- a/configs/model_glm_130b_v100.sh
+++ b/configs/model_glm_130b_v100.sh
@@ -1,5 +1,5 @@
MODEL_TYPE="glm-130b"
-CHECKPOINT_PATH=""
+CHECKPOINT_PATH="/root/130b/glm-130b-sat"
MP_SIZE=8
MODEL_ARGS="--model-parallel-size ${MP_SIZE}
--num-layers 70
diff --git a/scripts/generate.sh b/scripts/generate.sh
index 19bef0a..4732652 100644
--- a/scripts/generate.sh
+++ b/scripts/generate.sh
@@ -4,7 +4,7 @@ script_path=$(realpath $0)
script_dir=$(dirname $script_path)
main_dir=$(dirname $script_dir)

-source "${main_dir}/configs/model_glm_130b.sh"
+source "${main_dir}/configs/model_glm_130b_v100.sh"

SEED=1234
MAX_OUTPUT_LENGTH=256

只修改了两个文件:

  1. configs/model_glm_130b_v100.sh ,这个文件修改了 CHECKPOINT_PATH
  2. scripts/generate.sh 这个文件将 model_glm_130b.sh 改成了 model_glm_130b_v100.sh ,其他的没有修改

报错信息如下:

核心报错信息:

Traceback (most recent call last):
File "/root/miniconda3/envs/py39/bin/torchrun", line 8, in
sys.exit(main())
File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

完整报错信息

(py39) [root@iZbp1219pbxs72sxk8onovZ GLM-130B]# bash scripts/generate.sh --input-source interactive
WARNING:torch.distributed.run:


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


Setting ds_accelerator to cuda (auto detect)
Setting ds_accelerator to cuda (auto detect)
Setting ds_accelerator to cuda (auto detect)
Setting ds_accelerator to cuda (auto detect)
Setting ds_accelerator to cuda (auto detect)
Setting ds_accelerator to cuda (auto detect)
Setting ds_accelerator to cuda (auto detect)
Setting ds_accelerator to cuda (auto detect)
WARNING: No training data specified
WARNING: No training data specified
WARNING: No training data specified
WARNING: No training data specified
WARNING: No training data specified
WARNING: No training data specified
WARNING: No training data specified
WARNING: No training data specified
using world size: 8 and model-parallel size: 8

padded vocab (size: 150528) with 0 dummy tokens (new size: 150528)
initializing model parallel with size 8
Set tokenizer as a icetk-glm-130B tokenizer! Now you can get_tokenizer() everywhere.
global rank 7 is loading checkpoint /root/130b/glm-130b-sat/49300/mp_rank_07_model_states.pt
global rank 0 is loading checkpoint /root/130b/glm-130b-sat/49300/mp_rank_00_model_states.pt
global rank 4 is loading checkpoint /root/130b/glm-130b-sat/49300/mp_rank_04_model_states.pt
global rank 1 is loading checkpoint /root/130b/glm-130b-sat/49300/mp_rank_01_model_states.pt
global rank 6 is loading checkpoint /root/130b/glm-130b-sat/49300/mp_rank_06_model_states.pt
global rank 3 is loading checkpoint /root/130b/glm-130b-sat/49300/mp_rank_03_model_states.pt
global rank 2 is loading checkpoint /root/130b/glm-130b-sat/49300/mp_rank_02_model_states.pt
global rank 5 is loading checkpoint /root/130b/glm-130b-sat/49300/mp_rank_05_model_states.pt
successfully loaded /root/130b/glm-130b-sat/49300/mp_rank_01_model_states.pt
successfully loaded /root/130b/glm-130b-sat/49300/mp_rank_06_model_states.pt
successfully loaded /root/130b/glm-130b-sat/49300/mp_rank_00_model_states.pt
successfully loaded /root/130b/glm-130b-sat/49300/mp_rank_04_model_states.pt
successfully loaded /root/130b/glm-130b-sat/49300/mp_rank_03_model_states.pt
/root/miniconda3/envs/py39/lib/python3.9/site-packages/bminf/scheduler/init.py:221: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
total_size += param.numel() * param.storage().element_size()
successfully loaded /root/130b/glm-130b-sat/49300/mp_rank_02_model_states.pt
/root/miniconda3/envs/py39/lib/python3.9/site-packages/bminf/scheduler/init.py:221: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
total_size += param.numel() * param.storage().element_size()
successfully loaded /root/130b/glm-130b-sat/49300/mp_rank_07_model_states.pt
BMInf activated, memory limit: 25 GB
/root/miniconda3/envs/py39/lib/python3.9/site-packages/bminf/scheduler/init.py:221: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
total_size += param.numel() * param.storage().element_size()
/root/miniconda3/envs/py39/lib/python3.9/site-packages/bminf/scheduler/init.py:221: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
total_size += param.numel() * param.storage().element_size()
successfully loaded /root/130b/glm-130b-sat/49300/mp_rank_05_model_states.pt
/root/miniconda3/envs/py39/lib/python3.9/site-packages/bminf/scheduler/init.py:221: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
total_size += param.numel() * param.storage().element_size()
/root/miniconda3/envs/py39/lib/python3.9/site-packages/bminf/scheduler/init.py:221: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
total_size += param.numel() * param.storage().element_size()
/root/miniconda3/envs/py39/lib/python3.9/site-packages/bminf/scheduler/init.py:221: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
total_size += param.numel() * param.storage().element_size()
/root/miniconda3/envs/py39/lib/python3.9/site-packages/bminf/scheduler/init.py:221: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
total_size += param.numel() * param.storage().element_size()
Model initialized in 121.0s
/root/GLM-130B/generation/strategies.py:17: FutureWarning: In the future np.bool will be defined as the corresponding NumPy scalar.
self._is_done = np.zeros(self.batch_size, dtype=np.bool)
/root/GLM-130B/generation/strategies.py:17: FutureWarning: In the future np.bool will be defined as the corresponding NumPy scalar.
self._is_done = np.zeros(self.batch_size, dtype=np.bool)
/root/GLM-130B/generation/strategies.py:17: FutureWarning: In the future np.bool will be defined as the corresponding NumPy scalar.
self._is_done = np.zeros(self.batch_size, dtype=np.bool)
/root/GLM-130B/generation/strategies.py:17: FutureWarning: In the future np.bool will be defined as the corresponding NumPy scalar.
self._is_done = np.zeros(self.batch_size, dtype=np.bool)
/root/GLM-130B/generation/strategies.py:17: FutureWarning: In the future np.bool will be defined as the corresponding NumPy scalar.
self._is_done = np.zeros(self.batch_size, dtype=np.bool)
/root/GLM-130B/generation/strategies.py:17: FutureWarning: In the future np.bool will be defined as the corresponding NumPy scalar.
self._is_done = np.zeros(self.batch_size, dtype=np.bool)
/root/GLM-130B/generation/strategies.py:17: FutureWarning: In the future np.bool will be defined as the corresponding NumPy scalar.
self._is_done = np.zeros(self.batch_size, dtype=np.bool)
/root/GLM-130B/generation/strategies.py:17: FutureWarning: In the future np.bool will be defined as the corresponding NumPy scalar.
self._is_done = np.zeros(self.batch_size, dtype=np.bool)
Traceback (most recent call last):
Traceback (most recent call last):
File "/root/GLM-130B/generate.py", line 215, in
Traceback (most recent call last):
File "/root/GLM-130B/generate.py", line 215, in
Traceback (most recent call last):
Traceback (most recent call last):
File "/root/GLM-130B/generate.py", line 215, in
File "/root/GLM-130B/generate.py", line 215, in
File "/root/GLM-130B/generate.py", line 215, in
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
File "/root/GLM-130B/generate.py", line 215, in
File "/root/GLM-130B/generate.py", line 215, in
File "/root/GLM-130B/generate.py", line 215, in
main(args)main(args)

  File "/root/GLM-130B/generate.py", line 165, in main
      File "/root/GLM-130B/generate.py", line 165, in main

main(args)main(args)

main(args)
main(args)main(args) File "/root/GLM-130B/generate.py", line 165, in main
File "/root/GLM-130B/generate.py", line 165, in main

File "/root/GLM-130B/generate.py", line 165, in main
main(args) File "/root/GLM-130B/generate.py", line 165, in main

strategy = BaseStrategy(strategy = BaseStrategy(

File "/root/GLM-130B/generation/strategies.py", line 17, in init
File "/root/GLM-130B/generate.py", line 165, in main
File "/root/GLM-130B/generate.py", line 165, in main
strategy = BaseStrategy(
strategy = BaseStrategy(
File "/root/GLM-130B/generation/strategies.py", line 17, in init
strategy = BaseStrategy( File "/root/GLM-130B/generation/strategies.py", line 17, in init
File "/root/GLM-130B/generation/strategies.py", line 17, in init

  File "/root/GLM-130B/generation/strategies.py", line 17, in __init__

self._is_done = np.zeros(self.batch_size, dtype=np.bool)
File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/numpy/init.py", line 305, in getattr
strategy = BaseStrategy(
File "/root/GLM-130B/generation/strategies.py", line 17, in init
strategy = BaseStrategy(
File "/root/GLM-130B/generation/strategies.py", line 17, in init
strategy = BaseStrategy( self._is_done = np.zeros(self.batch_size, dtype=np.bool)
self._is_done = np.zeros(self.batch_size, dtype=np.bool)self._is_done = np.zeros(self.batch_size, dtype=np.bool)self._is_done = np.zeros(self.batch_size, dtype=np.bool)
File "/root/GLM-130B/generation/strategies.py", line 17, in init

self._is_done = np.zeros(self.batch_size, dtype=np.bool) File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/numpy/init.py", line 305, in getattr
File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/numpy/init.py", line 305, in getattr
File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/numpy/init.py", line 305, in getattr

File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/numpy/init.py", line 305, in getattr
File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/numpy/init.py", line 305, in getattr
self._is_done = np.zeros(self.batch_size, dtype=np.bool)
File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/numpy/init.py", line 305, in getattr
self._is_done = np.zeros(self.batch_size, dtype=np.bool)
File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/numpy/init.py", line 305, in getattr
raise AttributeError(former_attrs[attr])
AttributeError: module 'numpy' has no attribute 'bool'.
np.bool was a deprecated alias for the builtin bool. To avoid this error in existing code, use bool by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use np.bool_ here.
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
raise AttributeError(former_attrs[attr])
raise AttributeError(former_attrs[attr])raise AttributeError(former_attrs[attr]) raise AttributeError(former_attrs[attr])

AttributeError raise AttributeError(former_attrs[attr])
: raise AttributeError(former_attrs[attr])AttributeError
AttributeErrormodule 'numpy' has no attribute 'bool'.
np.bool was a deprecated alias for the builtin bool. To avoid this error in existing code, use bool by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use np.bool_ here.
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
AttributeError: : AttributeError
raise AttributeError(former_attrs[attr]): AttributeErrormodule 'numpy' has no attribute 'bool'.
np.bool was a deprecated alias for the builtin bool. To avoid this error in existing code, use bool by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use np.bool_ here.
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
https://numpy.org/devdocs/release/1.20.0-notes.html#deprecationsmodule 'numpy' has no attribute 'bool'.
np.bool was a deprecated alias for the builtin bool. To avoid this error in existing code, use bool by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use np.bool_ here.
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations:
module 'numpy' has no attribute 'bool'.
np.bool was a deprecated alias for the builtin bool. To avoid this error in existing code, use bool by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use np.bool_ here.
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations:

module 'numpy' has no attribute 'bool'.
np.bool was a deprecated alias for the builtin bool. To avoid this error in existing code, use bool by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use np.bool_ here.
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
AttributeErrormodule 'numpy' has no attribute 'bool'.
np.bool was a deprecated alias for the builtin bool. To avoid this error in existing code, use bool by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use np.bool_ here.
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
:
module 'numpy' has no attribute 'bool'.
np.bool was a deprecated alias for the builtin bool. To avoid this error in existing code, use bool by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use np.bool_ here.
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 52609 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 52610 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 52611 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 52612 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 52613 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 52614 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 52615 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 52608) of binary: /root/miniconda3/envs/py39/bin/python
Traceback (most recent call last):
File "/root/miniconda3/envs/py39/bin/torchrun", line 8, in
sys.exit(main())
File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/root/GLM-130B/generate.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2023-06-06_23:21:56
host : iZbp1219pbxs72sxk8onovZ
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 52608)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

@Sengxian 您了解吗?求指点🙏

已定位到问题,numpy版本的问题,安装 1.20.3即可

@yihuaxiang 您好, 您可以分享一下权重吗?

@zhyj3038

都是默认值,并没有修改,是因为安装了更新的numpy导致的,安装 1.20.3 就没报错了

@yihuaxiang 您遇到的这个问题,我也遇到并改好了。 我这边是没有权重文件,自己随机初始化了一下,然后报其他的错误, 例如: IndexError: Out of range: piece id is out of range. 所以想下载一个权重试试能不能解决

@yihuaxiang 您遇到的这个问题,我也遇到并改好了。 我这边是没有权重文件,自己随机初始化了一下,然后报其他的错误, 例如: IndexError: Out of range: piece id is out of range. 所以想下载一个权重试试能不能解决

哦哦,好吧,我也没改什么权重,直接 clone 的代码

我是没权重,给作者发了邮件,好久没收到回复

不你跟作者发邮件下载权重,就可以跑起来? 不太可能吧

我加您一个微信请教一下吧 我的是zhyj3038 麻烦您加一下, 谢谢~~

大佬们试过转int4的推理吗