THUDM / GLM

GLM (General Language Model)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

运行bash scripts/generate_block.sh config_tasks/model_blocklm_10B_chinese.sh报错

XiaozhuLove opened this issue · comments

root@58aa0f98defc:/workspace# bash scripts/generate_block.sh config_tasks/model_blocklm_10B_chinese.sh
Generate Samples
WARNING: No training data specified
using world size: 1 and model-parallel size: 1

using dynamic loss scaling
[2023-12-01 07:31:50,775] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl
initializing model parallel with size 1
initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
{'pad': 50000, 'eos': 50000, 'sep': 50001, 'ENC': 50002, 'MASK': 50003, 'unk': 50004, 'sop': 50006, 'eop': 50007, 'gMASK': 50007, 'sMASK': 50008}
padded vocab (size: 50009) with 39 dummy tokens (new size: 50048)
found end-of-document token: 50000
building GLM model ...
Killing subprocess 743
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 340, in
main()
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 326, in main
sigkill_handler(signal.SIGTERM, None) # not coming back
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/opt/conda/bin/python', '-u', 'generate_samples.py', '--local_rank=0', '--DDP-impl', 'none', '--model-parallel-size', '1', '--block-lm', '--cloze-eval', '--task-mask', '--num-layers', '48', '--hidden-size', '4096', '--num-attention-heads', '64', '--max-position-embeddings', '1024', '--tokenizer-type', 'ChineseSPTokenizer', '--load-pretrained', '/home/whzhu_st/Model/glm-10b-chinese', '--fp16', '--cache-dir', 'cache', '--out-seq-length', '512', '--seq-length', '512', '--temperature', '0.9', '--top-k', '40', '--top-p', '0']' died with <Signals.SIGKILL: 9>.
root@58aa0f98defc:/workspace#
请问这是什么原因?谢谢

我之前也有遇到,感觉可能是OOM了, 你看下 dmesg