THUDM / GLM

GLM (General Language Model)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

使用Zero-1+cpu_offload=true时,出现错误?

SkrDrag opened this issue · comments

运行的脚本:bash scripts/ds_finetune_superglue.sh \ config_tasks/model_blocklm_2B.sh \ config_tasks/task_copa.sh

[2023-07-22 01:03:41,666] [WARNING] [runner.py:117:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-07-22 01:03:48,454] [INFO] [runner.py:358:main] cmd = /opt/conda/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=61318 finetune_glm.py --deepspeed --deepspeed_config config_tasks/config_blocklm_10B.json --finetune --cloze-eval --experiment-name blocklm-2B-copa_07-22-01-03 --task COPA --data-dir /home/llw/workspace/dataset/COPA --save /home/llw/workspace/checkpoints --seq-length 256 --checkpoint-activations --eval-batch-size 16 --save-epoch 100000 --num-workers 1 --no-load-optim --no-load-lr-scheduler --block-lm --cloze-eval --task-mask --num-layers 36 --hidden-size 2048 --num-attention-heads 32 --max-position-embeddings 1024 --tokenizer-type GPT2BPETokenizer --load-pretrained /home/llw/workspace/checkpoints/blocklm-2b-512 --lr-decay-style linear --warmup 0.1 --weight-decay 1.0e-1 --pattern-id 0 --save-interval 10000 --log-interval 20 --eval-interval 1000 --eval-iters 100 --pattern-id 0 --fp16 --model-parallel-size 1 --epochs 100 --overwrite
[2023-07-22 01:03:49,366] [INFO] [launch.py:73:main] 0 NCCL_IB_DISABLE 0
[2023-07-22 01:03:49,367] [INFO] [launch.py:73:main] 0 NCCL_DEBUG info
[2023-07-22 01:03:49,367] [INFO] [launch.py:73:main] 0 NCCL_NET_GDR_LEVEL 2
[2023-07-22 01:03:49,367] [INFO] [launch.py:80:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]}
[2023-07-22 01:03:49,367] [INFO] [launch.py:86:main] nnodes=1, num_local_procs=4, node_rank=0
[2023-07-22 01:03:49,367] [INFO] [launch.py:101:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]})
[2023-07-22 01:03:49,367] [INFO] [launch.py:102:main] dist_world_size=4
[2023-07-22 01:03:49,367] [INFO] [launch.py:104:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3
[2023-07-22 01:03:50,965] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl
[2023-07-22 01:03:50,982] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl
[2023-07-22 01:03:50,991] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl
using world size: 4 and model-parallel size: 1

using dynamic loss scaling
[2023-07-22 01:03:50,999] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl
initializing model parallel with size 1
initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
padded vocab (size: 50265) with 39 dummy tokens (new size: 50304)
found end-of-document token: 50256
big-node0:864635:864635 [3] NCCL INFO Bootstrap : Using [0]eno1:210.45.124.81<0> [1]usb0:169.254.3.1<0> [2]virbr0:192.168.122.1<0> [3]veth8ac8d00:fe80::64d7:e7ff:feed:6a02%veth8ac8d00<0>
big-node0:864635:864635 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
big-node0:864635:864635 [3] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
big-node0:864635:864635 [3] NCCL INFO NET/IB : No device found.
big-node0:864635:864635 [3] NCCL INFO NET/Socket : Using [0]eno1:210.45.124.81<0> [1]usb0:169.254.3.1<0> [2]virbr0:192.168.122.1<0> [3]veth8ac8d00:fe80::64d7:e7ff:feed:6a02%veth8ac8d00<0>
big-node0:864635:864635 [3] NCCL INFO Using network Socket
NCCL version 2.7.8+cuda11.1
big-node0:864635:864924 [3] NCCL INFO Channel 00/32 : 0
big-node0:864635:864924 [3] NCCL INFO Channel 01/32 : 0
big-node0:864635:864924 [3] NCCL INFO Channel 02/32 : 0
big-node0:864635:864924 [3] NCCL INFO Channel 03/32 : 0
big-node0:864635:864924 [3] NCCL INFO Channel 04/32 : 0
big-node0:864635:864924 [3] NCCL INFO Channel 05/32 : 0
big-node0:864635:864924 [3] NCCL INFO Channel 06/32 : 0
big-node0:864635:864924 [3] NCCL INFO Channel 07/32 : 0
big-node0:864635:864924 [3] NCCL INFO Channel 08/32 : 0
big-node0:864635:864924 [3] NCCL INFO Channel 09/32 : 0
big-node0:864635:864924 [3] NCCL INFO Channel 10/32 : 0
big-node0:864635:864924 [3] NCCL INFO Channel 11/32 : 0
big-node0:864635:864924 [3] NCCL INFO Channel 12/32 : 0
big-node0:864635:864924 [3] NCCL INFO Channel 13/32 : 0
big-node0:864635:864924 [3] NCCL INFO Channel 14/32 : 0
big-node0:864635:864924 [3] NCCL INFO Channel 15/32 : 0
big-node0:864635:864924 [3] NCCL INFO Channel 16/32 : 0
big-node0:864635:864924 [3] NCCL INFO Channel 17/32 : 0
big-node0:864635:864924 [3] NCCL INFO Channel 18/32 : 0
big-node0:864635:864924 [3] NCCL INFO Channel 19/32 : 0
big-node0:864635:864924 [3] NCCL INFO Channel 20/32 : 0
big-node0:864635:864924 [3] NCCL INFO Channel 21/32 : 0
big-node0:864635:864924 [3] NCCL INFO Channel 22/32 : 0
big-node0:864635:864924 [3] NCCL INFO Channel 23/32 : 0
big-node0:864635:864924 [3] NCCL INFO Channel 24/32 : 0
big-node0:864635:864924 [3] NCCL INFO Channel 25/32 : 0
big-node0:864635:864924 [3] NCCL INFO Channel 26/32 : 0
big-node0:864635:864924 [3] NCCL INFO Channel 27/32 : 0
big-node0:864635:864924 [3] NCCL INFO Channel 28/32 : 0
big-node0:864635:864924 [3] NCCL INFO Channel 29/32 : 0
big-node0:864635:864924 [3] NCCL INFO Channel 30/32 : 0
big-node0:864635:864924 [3] NCCL INFO Channel 31/32 : 0
big-node0:864635:864924 [3] NCCL INFO Trees [0] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [1] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [2] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [3] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [4] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [5] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [6] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [7] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [8] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [9] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [10] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [11] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [12] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [13] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [14] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [15] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [16] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [17] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [18] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [19] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [20] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [21] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [22] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [23] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [24] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [25] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [26] -1/-1/-1->0->-1|-1->0-
big-node0:864635:864924 [3] NCCL INFO Setting affinity for GPU 3 to 0fffff,ff000000,0fffffff
big-node0:864635:864924 [3] NCCL INFO 32 coll channels, 32 p2p channels, 32 p2p channels per peer
big-node0:864635:864924 [3] NCCL INFO comm 0x7f7c5c002e10 rank 0 nranks 1 cudaDev 3 busId 57000 - Init COMPLETE
big-node0:864634:864634 [2] NCCL INFO Bootstrap : Using [0]eno1:210.45.124.81<0> [1]usb0:169.254.3.1<0> [2]virbr0:192.168.122.1<0> [3]veth8ac8d00:fe80::64d7:e7ff:feed:6a02%veth8ac8d00<0>
big-node0:864634:864634 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
big-node0:864634:864634 [2] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
big-node0:864634:864634 [2] NCCL INFO NET/IB : No device found.
big-node0:864634:864634 [2] NCCL INFO NET/Socket : Using [0]eno1:210.45.124.81<0> [1]usb0:169.254.3.1<0> [2]virbr0:192.168.122.1<0> [3]veth8ac8d00:fe80::64d7:e7ff:feed:6a02%veth8ac8d00<0>
big-node0:864634:864634 [2] NCCL INFO Using network Socket
NCCL version 2.7.8+cuda11.1
big-node0:864634:864931 [2] NCCL INFO Channel 00/32 : 0
big-node0:864634:864931 [2] NCCL INFO Channel 01/32 : 0
big-node0:864634:864931 [2] NCCL INFO Channel 02/32 : 0
big-node0:864634:864931 [2] NCCL INFO Channel 03/32 : 0
big-node0:864634:864931 [2] NCCL INFO Channel 04/32 : 0
big-node0:864634:864931 [2] NCCL INFO Channel 05/32 : 0
big-node0:864634:864931 [2] NCCL INFO Channel 06/32 : 0
big-node0:864634:864931 [2] NCCL INFO Channel 07/32 : 0
big-node0:864634:864931 [2] NCCL INFO Channel 08/32 : 0
big-node0:864634:864931 [2] NCCL INFO Channel 09/32 : 0
big-node0:864634:864931 [2] NCCL INFO Channel 10/32 : 0
big-node0:864634:864931 [2] NCCL INFO Channel 11/32 : 0
big-node0:864634:864931 [2] NCCL INFO Channel 12/32 : 0
big-node0:864634:864931 [2] NCCL INFO Channel 13/32 : 0
big-node0:864634:864931 [2] NCCL INFO Channel 14/32 : 0
big-node0:864634:864931 [2] NCCL INFO Channel 15/32 : 0
big-node0:864634:864931 [2] NCCL INFO Channel 16/32 : 0
big-node0:864634:864931 [2] NCCL INFO Channel 17/32 : 0
big-node0:864634:864931 [2] NCCL INFO Channel 18/32 : 0
big-node0:864634:864931 [2] NCCL INFO Channel 19/32 : 0
big-node0:864634:864931 [2] NCCL INFO Channel 20/32 : 0
big-node0:864634:864931 [2] NCCL INFO Channel 21/32 : 0
big-node0:864634:864931 [2] NCCL INFO Channel 22/32 : 0
big-node0:864634:864931 [2] NCCL INFO Channel 23/32 : 0
big-node0:864634:864931 [2] NCCL INFO Channel 24/32 : 0
big-node0:864634:864931 [2] NCCL INFO Channel 25/32 : 0
big-node0:864634:864931 [2] NCCL INFO Channel 26/32 : 0
big-node0:864634:864931 [2] NCCL INFO Channel 27/32 : 0
big-node0:864634:864931 [2] NCCL INFO Channel 28/32 : 0
big-node0:864634:864931 [2] NCCL INFO Channel 29/32 : 0
big-node0:864634:864931 [2] NCCL INFO Channel 30/32 : 0
big-node0:864634:864931 [2] NCCL INFO Channel 31/32 : 0
big-node0:864634:864931 [2] NCCL INFO Trees [0] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [1] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [2] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [3] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [4] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [5] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [6] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [7] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [8] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [9] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [10] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [11] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [12] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [13] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [14] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [15] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [16] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [17] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [18] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [19] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [20] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [21] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [22] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [23] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [24] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [25] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [26] -1/-1/-1->0->-1|-1->0-
big-node0:864634:864931 [2] NCCL INFO Setting affinity for GPU 2 to 0fffff,ff000000,0fffffff
big-node0:864634:864931 [2] NCCL INFO 32 coll channels, 32 p2p channels, 32 p2p channels per peer
big-node0:864634:864931 [2] NCCL INFO comm 0x7f7784002e10 rank 0 nranks 1 cudaDev 2 busId 56000 - Init COMPLETE
big-node0:864632:864632 [0] NCCL INFO Bootstrap : Using [0]eno1:210.45.124.81<0> [1]usb0:169.254.3.1<0> [2]virbr0:192.168.122.1<0> [3]veth8ac8d00:fe80::64d7:e7ff:feed:6a02%veth8ac8d00<0>
big-node0:864632:864632 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
big-node0:864632:864632 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
big-node0:864632:864632 [0] NCCL INFO NET/IB : No device found.
big-node0:864632:864632 [0] NCCL INFO NET/Socket : Using [0]eno1:210.45.124.81<0> [1]usb0:169.254.3.1<0> [2]virbr0:192.168.122.1<0> [3]veth8ac8d00:fe80::64d7:e7ff:feed:6a02%veth8ac8d00<0>
big-node0:864632:864632 [0] NCCL INFO Using network Socket
NCCL version 2.7.8+cuda11.1
big-node0:864633:864633 [1] NCCL INFO Bootstrap : Using [0]eno1:210.45.124.81<0> [1]usb0:169.254.3.1<0> [2]virbr0:192.168.122.1<0> [3]veth8ac8d00:fe80::64d7:e7ff:feed:6a02%veth8ac8d00<0>
big-node0:864633:864633 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
big-node0:864633:864633 [1] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
big-node0:864633:864633 [1] NCCL INFO NET/IB : No device found.
big-node0:864633:864633 [1] NCCL INFO NET/Socket : Using [0]eno1:210.45.124.81<0> [1]usb0:169.254.3.1<0> [2]virbr0:192.168.122.1<0> [3]veth8ac8d00:fe80::64d7:e7ff:feed:6a02%veth8ac8d00<0>
big-node0:864633:864633 [1] NCCL INFO Using network Socket
NCCL version 2.7.8+cuda11.1
big-node0:864633:864936 [1] NCCL INFO Channel 00/32 : 0
big-node0:864633:864936 [1] NCCL INFO Channel 01/32 : 0
big-node0:864633:864936 [1] NCCL INFO Channel 02/32 : 0
big-node0:864633:864936 [1] NCCL INFO Channel 03/32 : 0
big-node0:864633:864936 [1] NCCL INFO Channel 04/32 : 0
big-node0:864633:864936 [1] NCCL INFO Channel 05/32 : 0
big-node0:864633:864936 [1] NCCL INFO Channel 06/32 : 0
big-node0:864633:864936 [1] NCCL INFO Channel 07/32 : 0
big-node0:864633:864936 [1] NCCL INFO Channel 08/32 : 0
big-node0:864633:864936 [1] NCCL INFO Channel 09/32 : 0
big-node0:864633:864936 [1] NCCL INFO Channel 10/32 : 0
big-node0:864633:864936 [1] NCCL INFO Channel 11/32 : 0
big-node0:864633:864936 [1] NCCL INFO Channel 12/32 : 0
big-node0:864633:864936 [1] NCCL INFO Channel 13/32 : 0
big-node0:864633:864936 [1] NCCL INFO Channel 14/32 : 0
big-node0:864633:864936 [1] NCCL INFO Channel 15/32 : 0
big-node0:864633:864936 [1] NCCL INFO Channel 16/32 : 0
big-node0:864633:864936 [1] NCCL INFO Channel 17/32 : 0
big-node0:864633:864936 [1] NCCL INFO Channel 18/32 : 0
big-node0:864633:864936 [1] NCCL INFO Channel 19/32 : 0
big-node0:864633:864936 [1] NCCL INFO Channel 20/32 : 0
big-node0:864633:864936 [1] NCCL INFO Channel 21/32 : 0
big-node0:864633:864936 [1] NCCL INFO Channel 22/32 : 0
big-node0:864633:864936 [1] NCCL INFO Channel 23/32 : 0
big-node0:864633:864936 [1] NCCL INFO Channel 24/32 : 0
big-node0:864633:864936 [1] NCCL INFO Channel 25/32 : 0
big-node0:864633:864936 [1] NCCL INFO Channel 26/32 : 0
big-node0:864633:864936 [1] NCCL INFO Channel 27/32 : 0
big-node0:864633:864936 [1] NCCL INFO Channel 28/32 : 0
big-node0:864633:864936 [1] NCCL INFO Channel 29/32 : 0
big-node0:864633:864936 [1] NCCL INFO Channel 30/32 : 0
big-node0:864633:864936 [1] NCCL INFO Channel 31/32 : 0
big-node0:864633:864936 [1] NCCL INFO Trees [0] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [1] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [2] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [3] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [4] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [5] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [6] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [7] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [8] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [9] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [10] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [11] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [12] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [13] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [14] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [15] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [16] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [17] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [18] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [19] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [20] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [21] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [22] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [23] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [24] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [25] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [26] -1/-1/-1->0->-1|-1->0-
big-node0:864633:864936 [1] NCCL INFO Setting affinity for GPU 1 to 0fffff,ff000000,0fffffff
big-node0:864633:864936 [1] NCCL INFO 32 coll channels, 32 p2p channels, 32 p2p channels per peer
big-node0:864633:864936 [1] NCCL INFO comm 0x7f9094002e10 rank 0 nranks 1 cudaDev 1 busId 52000 - Init COMPLETE
big-node0:864632:864934 [0] NCCL INFO Channel 00/32 : 0
big-node0:864632:864934 [0] NCCL INFO Channel 01/32 : 0
big-node0:864632:864934 [0] NCCL INFO Channel 02/32 : 0
big-node0:864632:864934 [0] NCCL INFO Channel 03/32 : 0
big-node0:864632:864934 [0] NCCL INFO Channel 04/32 : 0
big-node0:864632:864934 [0] NCCL INFO Channel 05/32 : 0
big-node0:864632:864934 [0] NCCL INFO Channel 06/32 : 0
big-node0:864632:864934 [0] NCCL INFO Channel 07/32 : 0
big-node0:864632:864934 [0] NCCL INFO Channel 08/32 : 0
big-node0:864632:864934 [0] NCCL INFO Channel 09/32 : 0
big-node0:864632:864934 [0] NCCL INFO Channel 10/32 : 0
big-node0:864632:864934 [0] NCCL INFO Channel 11/32 : 0
big-node0:864632:864934 [0] NCCL INFO Channel 12/32 : 0
big-node0:864632:864934 [0] NCCL INFO Channel 13/32 : 0
big-node0:864632:864934 [0] NCCL INFO Channel 14/32 : 0
big-node0:864632:864934 [0] NCCL INFO Channel 15/32 : 0
big-node0:864632:864934 [0] NCCL INFO Channel 16/32 : 0
big-node0:864632:864934 [0] NCCL INFO Channel 17/32 : 0
big-node0:864632:864934 [0] NCCL INFO Channel 18/32 : 0
big-node0:864632:864934 [0] NCCL INFO Channel 19/32 : 0
big-node0:864632:864934 [0] NCCL INFO Channel 20/32 : 0
big-node0:864632:864934 [0] NCCL INFO Channel 21/32 : 0
big-node0:864632:864934 [0] NCCL INFO Channel 22/32 : 0
big-node0:864632:864934 [0] NCCL INFO Channel 23/32 : 0
big-node0:864632:864934 [0] NCCL INFO Channel 24/32 : 0
big-node0:864632:864934 [0] NCCL INFO Channel 25/32 : 0
big-node0:864632:864934 [0] NCCL INFO Channel 26/32 : 0
big-node0:864632:864934 [0] NCCL INFO Channel 27/32 : 0
big-node0:864632:864934 [0] NCCL INFO Channel 28/32 : 0
big-node0:864632:864934 [0] NCCL INFO Channel 29/32 : 0
big-node0:864632:864934 [0] NCCL INFO Channel 30/32 : 0
big-node0:864632:864934 [0] NCCL INFO Channel 31/32 : 0
big-node0:864632:864934 [0] NCCL INFO Trees [0] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [1] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [2] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [3] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [4] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [5] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [6] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [7] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [8] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [9] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [10] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [11] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [12] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [13] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [14] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [15] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [16] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [17] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [18] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [19] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [20] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [21] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [22] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [23] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [24] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [25] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [26] -1/-1/-1->0->-1|-1->0-
big-node0:864632:864934 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff,ff000000,0fffffff
big-node0:864632:864934 [0] NCCL INFO 32 coll channels, 32 p2p channels, 32 p2p channels per peer
big-node0:864632:864934 [0] NCCL INFO comm 0x7f130c002e10 rank 0 nranks 1 cudaDev 0 busId 4f000 - Init COMPLETE
Creating copa dataset from file at /home/llw/workspace/dataset/COPA (split=train)
Added 400 mirror examples, total size is 800...
Returning 800 train examples with label dist.: [(0, 400), (1, 400)]
Creating copa dataset from file at /home/llw/workspace/dataset/COPA (split=dev)
Returning 100 dev examples with label dist.: [(1, 45), (0, 55)]
building train and validation dataloaders ...
Creating copa dataset from file at /home/llw/workspace/dataset/COPA (split=dev)
Returning 100 dev examples with label dist.: [(1, 45), (0, 55)]
Creating copa dataset from file at /home/llw/workspace/dataset/COPA (split=test)
Returning 500 test examples with label dist.: [(None, 500)]
building GLM model ...
number of parameters on model parallel rank 0: 1920122880
DeepSpeed is enabled.
[2023-07-22 01:04:11,745] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed info: version=0.3.13, git-hash=unknown, git-branch=unknown
big-node0:864633:865177 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/64
big-node0:864632:865174 [0] NCCL INFO Channel 00/02 : 0 1 2 3
big-node0:864635:865175 [3] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/64
big-node0:864633:865177 [1] NCCL INFO Trees [0] 2/-1/-1->1->0|0->1->2/-1/-1 [1] 2/-1/-1->1->0|0->1->2/-1/-1
big-node0:864632:865174 [0] NCCL INFO Channel 01/02 : 0 1 2 3
big-node0:864635:865175 [3] NCCL INFO Trees [0] -1/-1/-1->3->2|2->3->-1/-1/-1 [1] -1/-1/-1->3->2|2->3->-1/-1/-1
big-node0:864635:865175 [3] NCCL INFO Setting affinity for GPU 3 to 0fffff,ff000000,0fffffff
big-node0:864633:865177 [1] NCCL INFO Setting affinity for GPU 1 to 0fffff,ff000000,0fffffff
big-node0:864634:865176 [2] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/64
big-node0:864634:865176 [2] NCCL INFO Trees [0] 3/-1/-1->2->1|1->2->3/-1/-1 [1] 3/-1/-1->2->1|1->2->3/-1/-1
big-node0:864634:865176 [2] NCCL INFO Setting affinity for GPU 2 to 0fffff,ff000000,0fffffff
big-node0:864632:865174 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/64
big-node0:864632:865174 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1|-1->0->1/-1/-1 [1] 1/-1/-1->0->-1|-1->0->1/-1/-1
big-node0:864632:865174 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff,ff000000,0fffffff
big-node0:864633:865177 [1] NCCL INFO Could not enable P2P between dev 1(=52000) and dev 0(=4f000)
big-node0:864635:865175 [3] NCCL INFO Could not enable P2P between dev 3(=57000) and dev 2(=56000)
big-node0:864632:865174 [0] NCCL INFO Could not enable P2P between dev 0(=4f000) and dev 3(=57000)
big-node0:864634:865176 [2] NCCL INFO Could not enable P2P between dev 2(=56000) and dev 1(=52000)
big-node0:864633:865177 [1] NCCL INFO Could not enable P2P between dev 1(=52000) and dev 2(=56000)
big-node0:864633:865177 [1] NCCL INFO Channel 00 : 1[52000] -> 2[56000] via direct shared memory
big-node0:864635:865175 [3] NCCL INFO Could not enable P2P between dev 3(=57000) and dev 0(=4f000)
big-node0:864635:865175 [3] NCCL INFO Channel 00 : 3[57000] -> 0[4f000] via direct shared memory
big-node0:864632:865174 [0] NCCL INFO Could not enable P2P between dev 0(=4f000) and dev 1(=52000)
big-node0:864632:865174 [0] NCCL INFO Channel 00 : 0[4f000] -> 1[52000] via direct shared memory
big-node0:864634:865176 [2] NCCL INFO Could not enable P2P between dev 2(=56000) and dev 3(=57000)
big-node0:864634:865176 [2] NCCL INFO Channel 00 : 2[56000] -> 3[57000] via direct shared memory
big-node0:864632:865174 [0] NCCL INFO Could not enable P2P between dev 0(=4f000) and dev 1(=52000)
big-node0:864635:865175 [3] NCCL INFO Could not enable P2P between dev 3(=57000) and dev 2(=56000)
big-node0:864633:865177 [1] NCCL INFO Could not enable P2P between dev 1(=52000) and dev 2(=56000)
big-node0:864635:865175 [3] NCCL INFO Channel 00 : 3[57000] -> 2[56000] via direct shared memory
big-node0:864634:865176 [2] NCCL INFO Could not enable P2P between dev 2(=56000) and dev 3(=57000)
big-node0:864633:865177 [1] NCCL INFO Could not enable P2P between dev 1(=52000) and dev 0(=4f000)
big-node0:864633:865177 [1] NCCL INFO Channel 00 : 1[52000] -> 0[4f000] via direct shared memory
big-node0:864634:865176 [2] NCCL INFO Could not enable P2P between dev 2(=56000) and dev 1(=52000)
big-node0:864634:865176 [2] NCCL INFO Channel 00 : 2[56000] -> 1[52000] via direct shared memory
big-node0:864632:865174 [0] NCCL INFO Could not enable P2P between dev 0(=4f000) and dev 3(=57000)
big-node0:864635:865175 [3] NCCL INFO Could not enable P2P between dev 3(=57000) and dev 2(=56000)
big-node0:864633:865177 [1] NCCL INFO Could not enable P2P between dev 1(=52000) and dev 0(=4f000)
big-node0:864634:865176 [2] NCCL INFO Could not enable P2P between dev 2(=56000) and dev 1(=52000)
big-node0:864632:865174 [0] NCCL INFO Could not enable P2P between dev 0(=4f000) and dev 1(=52000)
big-node0:864632:865174 [0] NCCL INFO Channel 01 : 0[4f000] -> 1[52000] via direct shared memory
big-node0:864633:865177 [1] NCCL INFO Could not enable P2P between dev 1(=52000) and dev 2(=56000)
big-node0:864633:865177 [1] NCCL INFO Channel 01 : 1[52000] -> 2[56000] via direct shared memory
big-node0:864634:865176 [2] NCCL INFO Could not enable P2P between dev 2(=56000) and dev 3(=57000)
big-node0:864634:865176 [2] NCCL INFO Channel 01 : 2[56000] -> 3[57000] via direct shared memory
big-node0:864635:865175 [3] NCCL INFO Could not enable P2P between dev 3(=57000) and dev 0(=4f000)
big-node0:864635:865175 [3] NCCL INFO Channel 01 : 3[57000] -> 0[4f000] via direct shared memory
big-node0:864633:865177 [1] NCCL INFO Could not enable P2P between dev 1(=52000) and dev 2(=56000)
big-node0:864632:865174 [0] NCCL INFO Could not enable P2P between dev 0(=4f000) and dev 1(=52000)
big-node0:864634:865176 [2] NCCL INFO Could not enable P2P between dev 2(=56000) and dev 3(=57000)
big-node0:864635:865175 [3] NCCL INFO Could not enable P2P between dev 3(=57000) and dev 2(=56000)
big-node0:864635:865175 [3] NCCL INFO Channel 01 : 3[57000] -> 2[56000] via direct shared memory
big-node0:864633:865177 [1] NCCL INFO Could not enable P2P between dev 1(=52000) and dev 0(=4f000)
big-node0:864633:865177 [1] NCCL INFO Channel 01 : 1[52000] -> 0[4f000] via direct shared memory
big-node0:864632:865174 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
big-node0:864632:865174 [0] NCCL INFO comm 0x7f11f0002e10 rank 0 nranks 4 cudaDev 0 busId 4f000 - Init COMPLETE
big-node0:864632:864632 [0] NCCL INFO Launch mode Parallel
big-node0:864634:865176 [2] NCCL INFO Could not enable P2P between dev 2(=56000) and dev 1(=52000)
big-node0:864634:865176 [2] NCCL INFO Channel 01 : 2[56000] -> 1[52000] via direct shared memory
big-node0:864633:865177 [1] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
big-node0:864633:865177 [1] NCCL INFO comm 0x7f8f80002e10 rank 1 nranks 4 cudaDev 1 busId 52000 - Init COMPLETE
big-node0:864635:865175 [3] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
big-node0:864634:865176 [2] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
big-node0:864635:865175 [3] NCCL INFO comm 0x7f79bc002e10 rank 3 nranks 4 cudaDev 3 busId 57000 - Init COMPLETE
big-node0:864634:865176 [2] NCCL INFO comm 0x7f7670002e10 rank 2 nranks 4 cudaDev 2 busId 56000 - Init COMPLETE
Using /home/llw/.cache/torch_extensions as PyTorch extensions root...
Using /home/llw/.cache/torch_extensions as PyTorch extensions root...
Using /home/llw/.cache/torch_extensions as PyTorch extensions root...
Using /home/llw/.cache/torch_extensions as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/llw/.cache/torch_extensions/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 0.6447687149047852 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 0.5554640293121338 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 0.6554622650146484 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 0.7334985733032227 seconds
Adam Optimizer #0 is created with AVX512 arithmetic capability.
Config: alpha=0.000005, betas=(0.900000, 0.950000), weight_decay=0.010000, adam_w=1
[2023-07-22 01:04:17,175] [INFO] [stage1.py:152:init] ZeRO Elastic Checkpoint = True
Adam Optimizer #0 is created with AVX512 arithmetic capability.
Config: alpha=0.000005, betas=(0.900000, 0.950000), weight_decay=0.010000, adam_w=1
[2023-07-22 01:04:17,187] [INFO] [engine.py:600:_configure_optimizer] Using DeepSpeed Optimizer param name adam as basic optimizer
[2023-07-22 01:04:17,187] [INFO] [engine.py:605:_configure_optimizer] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam
Checking ZeRO support for optimizer=DeepSpeedCPUAdam type=<class 'deepspeed.ops.adam.cpu_adam.DeepSpeedCPUAdam'>
[2023-07-22 01:04:17,187] [INFO] [logging.py:60:log_dist] [Rank 0] Creating fp16 ZeRO stage 1 optimizer
[2023-07-22 01:04:17,187] [INFO] [stage1.py:152:init] ZeRO Elastic Checkpoint = True
[2023-07-22 01:04:17,187] [INFO] [logging.py:60:log_dist] [Rank 0] Updating max_elements_per_comm from 50000000.0 -> 62626055.0
[2023-07-22 01:04:17,187] [INFO] [logging.py:60:log_dist] [Rank 0] Total number of elements in model: 1919160320, max elements per com: 62626055.0
[2023-07-22 01:04:17,187] [INFO] [logging.py:60:log_dist] [Rank 0] sub_partition_count: 31, sub_partition_size: 15656513, padding: 22247292
[2023-07-22 01:04:17,187] [INFO] [logging.py:60:log_dist] [Rank 0] number of elements with padding: 1919160320 + 22247292 = 1941407612
[2023-07-22 01:04:17,193] [INFO] [stage1.py:367:get_data_parallel_sub_partitions] **** partition info:
[2023-07-22 01:04:17,193] [INFO] [stage1.py:368:get_data_parallel_sub_partitions] total_num_elements=1941407612
[2023-07-22 01:04:17,193] [INFO] [stage1.py:369:get_data_parallel_sub_partitions] world_size=4
[2023-07-22 01:04:17,193] [INFO] [stage1.py:370:get_data_parallel_sub_partitions] max_elements_per_comm=62626055.0
[2023-07-22 01:04:17,193] [INFO] [stage1.py:371:get_data_parallel_sub_partitions] sub_partition_size=15656513
[2023-07-22 01:04:17,193] [INFO] [stage1.py:372:get_data_parallel_sub_partitions] num_sub_partitions=124
[2023-07-22 01:04:17,193] [INFO] [stage1.py:373:get_data_parallel_sub_partitions] num_comm_intervals=31
[2023-07-22 01:04:17,193] [INFO] [stage1.py:374:get_data_parallel_sub_partitions] ****
Adam Optimizer #0 is created with AVX512 arithmetic capability.
Config: alpha=0.000005, betas=(0.900000, 0.950000), weight_decay=0.010000, adam_w=1
[2023-07-22 01:04:17,205] [INFO] [stage1.py:152:init] ZeRO Elastic Checkpoint = True
[2023-07-22 01:04:17,210] [INFO] [logging.py:60:log_dist] [Rank 0] Using default max_elements_per_comm 50000000.0
[2023-07-22 01:04:17,211] [INFO] [logging.py:60:log_dist] [Rank 0] Total number of elements in model: 962560, max elements per com: 50000000.0
[2023-07-22 01:04:17,211] [INFO] [logging.py:60:log_dist] [Rank 0] sub_partition_count: 1, sub_partition_size: 240640, padding: 0
[2023-07-22 01:04:17,211] [INFO] [logging.py:60:log_dist] [Rank 0] number of elements with padding: 962560 + 0 = 962560
[2023-07-22 01:04:17,215] [INFO] [stage1.py:367:get_data_parallel_sub_partitions] **** partition info:
[2023-07-22 01:04:17,215] [INFO] [stage1.py:368:get_data_parallel_sub_partitions] total_num_elements=962560
[2023-07-22 01:04:17,215] [INFO] [stage1.py:369:get_data_parallel_sub_partitions] world_size=4
[2023-07-22 01:04:17,215] [INFO] [stage1.py:370:get_data_parallel_sub_partitions] max_elements_per_comm=962560
[2023-07-22 01:04:17,215] [INFO] [stage1.py:371:get_data_parallel_sub_partitions] sub_partition_size=240640
[2023-07-22 01:04:17,215] [INFO] [stage1.py:372:get_data_parallel_sub_partitions] num_sub_partitions=4
[2023-07-22 01:04:17,215] [INFO] [stage1.py:373:get_data_parallel_sub_partitions] num_comm_intervals=1
[2023-07-22 01:04:17,215] [INFO] [stage1.py:374:get_data_parallel_sub_partitions] ****
Adam Optimizer #0 is created with AVX512 arithmetic capability.
Config: alpha=0.000005, betas=(0.900000, 0.950000), weight_decay=0.010000, adam_w=1
[2023-07-22 01:04:17,236] [INFO] [stage1.py:152:init] ZeRO Elastic Checkpoint = True
Killing subprocess 864632
Killing subprocess 864633
Killing subprocess 864634
Killing subprocess 864635
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 171, in
main()
File "/opt/conda/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 161, in main
sigkill_handler(signal.SIGTERM, None) # not coming back
File "/opt/conda/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 139, in sigkill_handler
raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/opt/conda/bin/python', '-u', 'finetune_glm.py', '--local_rank=3', '--deepspeed', '--deepspeed_config', 'config_tasks/config_blocklm_10B.json', '--finetune', '--cloze-eval', '--experiment-name', 'blocklm-2B-copa_07-22-01-03', '--task', 'COPA', '--data-dir', '/home/llw/workspace/dataset/COPA', '--save', '/home/llw/workspace/checkpoints', '--seq-length', '256', '--checkpoint-activations', '--eval-batch-size', '16', '--save-epoch', '100000', '--num-workers', '1', '--no-load-optim', '--no-load-lr-scheduler', '--block-lm', '--cloze-eval', '--task-mask', '--num-layers', '36', '--hidden-size', '2048', '--num-attention-heads', '32', '--max-position-embeddings', '1024', '--tokenizer-type', 'GPT2BPETokenizer', '--load-pretrained', '/home/llw/workspace/checkpoints/blocklm-2b-512', '--lr-decay-style', 'linear', '--warmup', '0.1', '--weight-decay', '1.0e-1', '--pattern-id', '0', '--save-interval', '10000', '--log-interval', '20', '--eval-interval', '1000', '--eval-iters', '100', '--pattern-id', '0', '--fp16', '--model-parallel-size', '1', '--epochs', '100', '--overwrite']' died with <Signals.SIGSEGV: 11>.

使用Zero-2+cpu_offload=true时并不会出错,能正常工作。
请问这是为什么呢?