V100（8 * 32G）运行报错

Question

V100（8 * 32G）运行报错

yihuaxiang opened this issue a year ago · comments

下载模型并解压后修改配置并运行bash scripts/generate.sh --input-source interactive 后报错

配置修改内容如下：

(py39) [root@iZbp1219pbxs72sxk8onovZ GLM-130B]# git diff
diff --git a/configs/model_glm_130b_v100.sh b/configs/model_glm_130b_v100.sh
index 0b33485..1a474a8 100644
--- a/configs/model_glm_130b_v100.sh
+++ b/configs/model_glm_130b_v100.sh
@@ -1,5 +1,5 @@
MODEL_TYPE="glm-130b"
-CHECKPOINT_PATH=""
+CHECKPOINT_PATH="/root/130b/glm-130b-sat"
MP_SIZE=8
MODEL_ARGS="--model-parallel-size ${MP_SIZE}
--num-layers 70
diff --git a/scripts/generate.sh b/scripts/generate.sh
index 19bef0a..4732652 100644
--- a/scripts/generate.sh
+++ b/scripts/generate.sh
@@ -4,7 +4,7 @@ script_path=$(realpath $0)
script_dir=$(dirname $script_path)
main_dir=$(dirname $script_dir)

-source "${main_dir}/configs/model_glm_130b.sh"
+source "${main_dir}/configs/model_glm_130b_v100.sh"

SEED=1234
MAX_OUTPUT_LENGTH=256

只修改了两个文件：

configs/model_glm_130b_v100.sh ，这个文件修改了 CHECKPOINT_PATH
scripts/generate.sh 这个文件将 model_glm_130b.sh 改成了 model_glm_130b_v100.sh ，其他的没有修改

报错信息如下：

核心报错信息：

Traceback (most recent call last):
File "/root/miniconda3/envs/py39/bin/torchrun", line 8, in
sys.exit(main())
File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

完整报错信息

(py39) [root@iZbp1219pbxs72sxk8onovZ GLM-130B]# bash scripts/generate.sh --input-source interactive
WARNING:torch.distributed.run:

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

Setting ds_accelerator to cuda (auto detect)
Setting ds_accelerator to cuda (auto detect)
Setting ds_accelerator to cuda (auto detect)
Setting ds_accelerator to cuda (auto detect)
Setting ds_accelerator to cuda (auto detect)
Setting ds_accelerator to cuda (auto detect)
Setting ds_accelerator to cuda (auto detect)
Setting ds_accelerator to cuda (auto detect)
WARNING: No training data specified
WARNING: No training data specified
WARNING: No training data specified
WARNING: No training data specified
WARNING: No training data specified
WARNING: No training data specified
WARNING: No training data specified
WARNING: No training data specified
using world size: 8 and model-parallel size: 8

padded vocab (size: 150528) with 0 dummy tokens (new size: 150528)
initializing model parallel with size 8
Set tokenizer as a icetk-glm-130B tokenizer! Now you can get_tokenizer() everywhere.
global rank 7 is loading checkpoint /root/130b/glm-130b-sat/49300/mp_rank_07_model_states.pt
global rank 0 is loading checkpoint /root/130b/glm-130b-sat/49300/mp_rank_00_model_states.pt
global rank 4 is loading checkpoint /root/130b/glm-130b-sat/49300/mp_rank_04_model_states.pt
global rank 1 is loading checkpoint /root/130b/glm-130b-sat/49300/mp_rank_01_model_states.pt
global rank 6 is loading checkpoint /root/130b/glm-130b-sat/49300/mp_rank_06_model_states.pt
global rank 3 is loading checkpoint /root/130b/glm-130b-sat/49300/mp_rank_03_model_states.pt
global rank 2 is loading checkpoint /root/130b/glm-130b-sat/49300/mp_rank_02_model_states.pt
global rank 5 is loading checkpoint /root/130b/glm-130b-sat/49300/mp_rank_05_model_states.pt
successfully loaded /root/130b/glm-130b-sat/49300/mp_rank_01_model_states.pt
successfully loaded /root/130b/glm-130b-sat/49300/mp_rank_06_model_states.pt
successfully loaded /root/130b/glm-130b-sat/49300/mp_rank_00_model_states.pt
successfully loaded /root/130b/glm-130b-sat/49300/mp_rank_04_model_states.pt
successfully loaded /root/130b/glm-130b-sat/49300/mp_rank_03_model_states.pt
/root/miniconda3/envs/py39/lib/python3.9/site-packages/bminf/scheduler/init.py:221: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
total_size += param.numel() * param.storage().element_size()
successfully loaded /root/130b/glm-130b-sat/49300/mp_rank_02_model_states.pt
/root/miniconda3/envs/py39/lib/python3.9/site-packages/bminf/scheduler/init.py:221: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
total_size += param.numel() * param.storage().element_size()
successfully loaded /root/130b/glm-130b-sat/49300/mp_rank_07_model_states.pt
BMInf activated, memory limit: 25 GB
/root/miniconda3/envs/py39/lib/python3.9/site-packages/bminf/scheduler/init.py:221: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
total_size += param.numel() * param.storage().element_size()
/root/miniconda3/envs/py39/lib/python3.9/site-packages/bminf/scheduler/init.py:221: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
total_size += param.numel() * param.storage().element_size()
successfully loaded /root/130b/glm-130b-sat/49300/mp_rank_05_model_states.pt
/root/miniconda3/envs/py39/lib/python3.9/site-packages/bminf/scheduler/init.py:221: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
total_size += param.numel() * param.storage().element_size()
/root/miniconda3/envs/py39/lib/python3.9/site-packages/bminf/scheduler/init.py:221: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
total_size += param.numel() * param.storage().element_size()
/root/miniconda3/envs/py39/lib/python3.9/site-packages/bminf/scheduler/init.py:221: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
total_size += param.numel() * param.storage().element_size()
/root/miniconda3/envs/py39/lib/python3.9/site-packages/bminf/scheduler/init.py:221: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
total_size += param.numel() * param.storage().element_size()
Model initialized in 121.0s
/root/GLM-130B/generation/strategies.py:17: FutureWarning: In the future np.bool will be defined as the corresponding NumPy scalar.
self._is_done = np.zeros(self.batch_size, dtype=np.bool)
/root/GLM-130B/generation/strategies.py:17: FutureWarning: In the future np.bool will be defined as the corresponding NumPy scalar.
self._is_done = np.zeros(self.batch_size, dtype=np.bool)
/root/GLM-130B/generation/strategies.py:17: FutureWarning: In the future np.bool will be defined as the corresponding NumPy scalar.
self._is_done = np.zeros(self.batch_size, dtype=np.bool)
/root/GLM-130B/generation/strategies.py:17: FutureWarning: In the future np.bool will be defined as the corresponding NumPy scalar.
self._is_done = np.zeros(self.batch_size, dtype=np.bool)
/root/GLM-130B/generation/strategies.py:17: FutureWarning: In the future np.bool will be defined as the corresponding NumPy scalar.
self._is_done = np.zeros(self.batch_size, dtype=np.bool)
/root/GLM-130B/generation/strategies.py:17: FutureWarning: In the future np.bool will be defined as the corresponding NumPy scalar.
self._is_done = np.zeros(self.batch_size, dtype=np.bool)
/root/GLM-130B/generation/strategies.py:17: FutureWarning: In the future np.bool will be defined as the corresponding NumPy scalar.
self._is_done = np.zeros(self.batch_size, dtype=np.bool)
/root/GLM-130B/generation/strategies.py:17: FutureWarning: In the future np.bool will be defined as the corresponding NumPy scalar.
self._is_done = np.zeros(self.batch_size, dtype=np.bool)
Traceback (most recent call last):
Traceback (most recent call last):
File "/root/GLM-130B/generate.py", line 215, in
Traceback (most recent call last):
File "/root/GLM-130B/generate.py", line 215, in
Traceback (most recent call last):
Traceback (most recent call last):
File "/root/GLM-130B/generate.py", line 215, in
File "/root/GLM-130B/generate.py", line 215, in
File "/root/GLM-130B/generate.py", line 215, in
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
File "/root/GLM-130B/generate.py", line 215, in
File "/root/GLM-130B/generate.py", line 215, in
File "/root/GLM-130B/generate.py", line 215, in
main(args)main(args)

  File "/root/GLM-130B/generate.py", line 165, in main
      File "/root/GLM-130B/generate.py", line 165, in main

main(args)main(args)

main(args)
main(args)main(args) File "/root/GLM-130B/generate.py", line 165, in main
File "/root/GLM-130B/generate.py", line 165, in main

File "/root/GLM-130B/generate.py", line 165, in main
main(args) File "/root/GLM-130B/generate.py", line 165, in main

strategy = BaseStrategy(strategy = BaseStrategy(

File "/root/GLM-130B/generation/strategies.py", line 17, in init
File "/root/GLM-130B/generate.py", line 165, in main
File "/root/GLM-130B/generate.py", line 165, in main
strategy = BaseStrategy(
strategy = BaseStrategy(
File "/root/GLM-130B/generation/strategies.py", line 17, in init
strategy = BaseStrategy( File "/root/GLM-130B/generation/strategies.py", line 17, in init
File "/root/GLM-130B/generation/strategies.py", line 17, in init

  File "/root/GLM-130B/generation/strategies.py", line 17, in __init__

self._is_done = np.zeros(self.batch_size, dtype=np.bool)
File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/numpy/init.py", line 305, in getattr
strategy = BaseStrategy(
File "/root/GLM-130B/generation/strategies.py", line 17, in init
strategy = BaseStrategy(
File "/root/GLM-130B/generation/strategies.py", line 17, in init
strategy = BaseStrategy( self._is_done = np.zeros(self.batch_size, dtype=np.bool)
self._is_done = np.zeros(self.batch_size, dtype=np.bool)self._is_done = np.zeros(self.batch_size, dtype=np.bool)self._is_done = np.zeros(self.batch_size, dtype=np.bool)
File "/root/GLM-130B/generation/strategies.py", line 17, in init

self._is_done = np.zeros(self.batch_size, dtype=np.bool) File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/numpy/init.py", line 305, in getattr
File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/numpy/init.py", line 305, in getattr
File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/numpy/init.py", line 305, in getattr

File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/numpy/init.py", line 305, in getattr
File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/numpy/init.py", line 305, in getattr
self._is_done = np.zeros(self.batch_size, dtype=np.bool)
File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/numpy/init.py", line 305, in getattr
self._is_done = np.zeros(self.batch_size, dtype=np.bool)
File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/numpy/init.py", line 305, in getattr
raise AttributeError(former_attrs[attr])
AttributeError: module 'numpy' has no attribute 'bool'.
np.bool was a deprecated alias for the builtin bool. To avoid this error in existing code, use bool by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use np.bool_ here.
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
raise AttributeError(former_attrs[attr])
raise AttributeError(former_attrs[attr])raise AttributeError(former_attrs[attr]) raise AttributeError(former_attrs[attr])

AttributeError raise AttributeError(former_attrs[attr])
: raise AttributeError(former_attrs[attr])AttributeError
AttributeErrormodule 'numpy' has no attribute 'bool'.
np.bool was a deprecated alias for the builtin bool. To avoid this error in existing code, use bool by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use np.bool_ here.
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
AttributeError: : AttributeError
raise AttributeError(former_attrs[attr]): AttributeErrormodule 'numpy' has no attribute 'bool'.
np.bool was a deprecated alias for the builtin bool. To avoid this error in existing code, use bool by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use np.bool_ here.
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
https://numpy.org/devdocs/release/1.20.0-notes.html#deprecationsmodule 'numpy' has no attribute 'bool'.
np.bool was a deprecated alias for the builtin bool. To avoid this error in existing code, use bool by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use np.bool_ here.
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations:
module 'numpy' has no attribute 'bool'.
np.bool was a deprecated alias for the builtin bool. To avoid this error in existing code, use bool by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use np.bool_ here.
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations:

module 'numpy' has no attribute 'bool'.
`np.bool` was a deprecated alias for the builtin `bool`. To avoid this error in existing code, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
AttributeErrormodule 'numpy' has no attribute 'bool'.
`np.bool` was a deprecated alias for the builtin `bool`. To avoid this error in existing code, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
:
module 'numpy' has no attribute 'bool'.
`np.bool` was a deprecated alias for the builtin `bool`. To avoid this error in existing code, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 52609 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 52610 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 52611 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 52612 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 52613 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 52614 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 52615 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 52608) of binary: /root/miniconda3/envs/py39/bin/python
Traceback (most recent call last):
File "/root/miniconda3/envs/py39/bin/torchrun", line 8, in
sys.exit(main())
File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, kwargs)
File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in call**
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/root/GLM-130B/generate.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2023-06-06_23:21:56
host : iZbp1219pbxs72sxk8onovZ
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 52608)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

移花香 · Answer 1 · Tue Jun 06 2023 23:40:17 GMT+0800 (China Standard Time)

@Sengxian 您了解吗？求指点🙏

移花香 · Answer 2 · Wed Jun 07 2023 00:43:07 GMT+0800 (China Standard Time)

已定位到问题，numpy版本的问题，安装 1.20.3即可

zhang ya jun · Answer 3 · Wed Jun 07 2023 14:20:56 GMT+0800 (China Standard Time)

@yihuaxiang 您好，您可以分享一下权重吗？

移花香 · Answer 4 · Wed Jun 07 2023 14:25:01 GMT+0800 (China Standard Time)

@zhyj3038

都是默认值，并没有修改，是因为安装了更新的numpy导致的，安装 1.20.3 就没报错了

zhang ya jun · Answer 5 · Wed Jun 07 2023 14:48:32 GMT+0800 (China Standard Time)

@yihuaxiang 您遇到的这个问题，我也遇到并改好了。我这边是没有权重文件，自己随机初始化了一下，然后报其他的错误，例如： IndexError: Out of range: piece id is out of range. 所以想下载一个权重试试能不能解决

移花香 · Answer 6 · Wed Jun 07 2023 14:50:51 GMT+0800 (China Standard Time)

@yihuaxiang 您遇到的这个问题，我也遇到并改好了。我这边是没有权重文件，自己随机初始化了一下，然后报其他的错误，例如： IndexError: Out of range: piece id is out of range. 所以想下载一个权重试试能不能解决

哦哦，好吧，我也没改什么权重，直接 clone 的代码

zhang ya jun · Answer 7 · Wed Jun 07 2023 14:52:31 GMT+0800 (China Standard Time)

我是没权重，给作者发了邮件，好久没收到回复

移花香 · Answer 8 · Wed Jun 07 2023 15:19:56 GMT+0800 (China Standard Time)

不太理解，为啥我就不缺少权重文件呢？🤔️ zhang ya jun ***@***.***> 于2023年6月7日周三 14:52写道：

…

我是没权重，给作者发了邮件，好久没收到回复 — Reply to this email directly, view it on GitHub <#168 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABZ7FDXN76YQJEZPCOABGMTXKAQLVANCNFSM6AAAAAAY4S22V4> . You are receiving this because you were mentioned.Message ID: ***@***.***>

zhang ya jun · Answer 9 · Wed Jun 07 2023 15:20:52 GMT+0800 (China Standard Time)

不你跟作者发邮件下载权重，就可以跑起来？不太可能吧

移花香 · Answer 10 · Wed Jun 07 2023 15:22:26 GMT+0800 (China Standard Time)

是啊，直接 clone 代码，不需要什么权重 zhang ya jun ***@***.***> 于2023年6月7日周三 15:21写道：

…

不你跟作者发邮件下载权重，就可以跑起来？不太可能吧 — Reply to this email directly, view it on GitHub <#168 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABZ7FDXQWIYXNFLL5PW7EL3XKATV7ANCNFSM6AAAAAAY4S22V4> . You are receiving this because you were mentioned.Message ID: ***@***.***>

移花香 · Answer 11 · Wed Jun 07 2023 15:23:02 GMT+0800 (China Standard Time)

已经运行成功了 zhang ya jun ***@***.***> 于2023年6月7日周三 15:21写道：

…

不你跟作者发邮件下载权重，就可以跑起来？不太可能吧 — Reply to this email directly, view it on GitHub <#168 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABZ7FDXQWIYXNFLL5PW7EL3XKATV7ANCNFSM6AAAAAAY4S22V4> . You are receiving this because you were mentioned.Message ID: ***@***.***>

zhang ya jun · Answer 12 · Wed Jun 07 2023 15:25:00 GMT+0800 (China Standard Time)

我加您一个微信请教一下吧我的是zhyj3038 麻烦您加一下，谢谢~~

移花香 · Answer 13 · Wed Jun 07 2023 15:26:22 GMT+0800 (China Standard Time)

👌 zhang ya jun ***@***.***> 于2023年6月7日周三 15:25写道：

…

我加您一个微信请教一下吧我的是zhyj3038 麻烦您加一下，谢谢~~ — Reply to this email directly, view it on GitHub <#168 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABZ7FDS42RINJ3ORY6QWOTDXKAUFPANCNFSM6AAAAAAY4S22V4> . You are receiving this because you were mentioned.Message ID: ***@***.***>

wenshuop · Answer 14 · Wed Jun 14 2023 09:42:19 GMT+0800 (China Standard Time)

大佬们试过转int4的推理吗

V100（8 * 32G）运行报错

配置修改内容如下：

报错信息如下：

核心报错信息：

完整报错信息

/root/GLM-130B/generate.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2023-06-06_23:21:56 host : iZbp1219pbxs72sxk8onovZ rank : 0 (local_rank: 0) exitcode : 1 (pid: 52608) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2023-06-06_23:21:56
host : iZbp1219pbxs72sxk8onovZ
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 52608)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html