[BUG] <title>模型卡在trainer.train()一直不训练
limllzu opened this issue · comments
是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?
- 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions
该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?
- 我已经搜索过FAQ | I have searched FAQ
当前行为 | Current Behavior
数据集加载都没有问题,模型一直卡在finetune.py文件中的trainer.trian()
期望行为 | Expected Behavior
No response
复现方法 | Steps To Reproduce
数据:
[
{
"id": "0",
"image": "path/image/001.jpg",
"conversations": [
{
"role": "user",
"content": "\nHow many desserts are on the white plate?"
},
{
"role": "assistant",
"content": "There are three desserts on the white plate."
},
{
"role": "user",
"content": "What type of desserts are they?"
},
{
"role": "assistant",
"content": "The desserts are cakes with bananas and pecans on top. They share similarities with donuts, but the presence of bananas and pecans differentiates them."
},
{
"role": "user",
"content": "What is the setting of the image?"
},
{
"role": "assistant",
"content": "The image is set on a table top with a plate containing the three desserts."
}
]
}
]
运行环境 | Environment
包环境;
# Name Version Build Channel
_libgcc_mutex 0.1 conda_forge conda-forge
_openmp_mutex 4.5 2_gnu conda-forge
absl-py 2.1.0 pypi_0 pypi
accelerate 0.30.1 pypi_0 pypi
addict 2.4.0 pypi_0 pypi
aiofiles 23.2.1 pypi_0 pypi
altair 5.3.0 pypi_0 pypi
annotated-types 0.7.0 pypi_0 pypi
anyio 4.4.0 pypi_0 pypi
attrs 23.2.0 pypi_0 pypi
binutils_impl_linux-64 2.36.1 h193b22a_2 conda-forge
binutils_linux-64 2.36 hf3e587d_10 conda-forge
bitsandbytes-cuda114 0.26.0.post2 pypi_0 pypi
blessed 1.20.0 pypi_0 pypi
blinker 1.8.2 pypi_0 pypi
blis 0.7.11 pypi_0 pypi
bzip2 1.0.8 h5eee18b_6 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
ca-certificates 2024.6.2 hbcca054_0 conda-forge
cachetools 5.3.3 pypi_0 pypi
catalogue 2.0.10 pypi_0 pypi
certifi 2024.2.2 pypi_0 pypi
charset-normalizer 3.3.2 pypi_0 pypi
click 8.1.7 pypi_0 pypi
cloudpathlib 0.16.0 pypi_0 pypi
cmake 3.25.0 pypi_0 pypi
colorama 0.4.6 pypi_0 pypi
confection 0.1.5 pypi_0 pypi
contourpy 1.2.1 pypi_0 pypi
cycler 0.12.1 pypi_0 pypi
cymem 2.0.8 pypi_0 pypi
deepspeed 0.14.3 pypi_0 pypi
editdistance 0.6.2 pypi_0 pypi
einops 0.7.0 pypi_0 pypi
et-xmlfile 1.1.0 pypi_0 pypi
exceptiongroup 1.2.1 pypi_0 pypi
fairscale 0.4.0 pypi_0 pypi
fastapi 0.110.3 pypi_0 pypi
ffmpy 0.3.2 pypi_0 pypi
filelock 3.14.0 pypi_0 pypi
flask 3.0.3 pypi_0 pypi
fonttools 4.53.0 pypi_0 pypi
fsspec 2024.5.0 pypi_0 pypi
gcc_impl_linux-64 11.2.0 h82a94d6_16 conda-forge
gcc_linux-64 11.2.0 h39a9532_10 conda-forge
gpustat 1.1.1 pypi_0 pypi
gradio 4.26.0 pypi_0 pypi
gradio-client 0.15.1 pypi_0 pypi
grpcio 1.64.1 pypi_0 pypi
gxx_impl_linux-64 11.2.0 h82a94d6_16 conda-forge
gxx_linux-64 11.2.0 hacbe6df_10 conda-forge
h11 0.14.0 pypi_0 pypi
hjson 3.1.0 pypi_0 pypi
httpcore 1.0.5 pypi_0 pypi
httpx 0.27.0 pypi_0 pypi
huggingface-hub 0.23.2 pypi_0 pypi
idna 3.7 pypi_0 pypi
importlib-resources 6.4.0 pypi_0 pypi
install 1.3.5 pypi_0 pypi
itsdangerous 2.2.0 pypi_0 pypi
jinja2 3.1.4 pypi_0 pypi
joblib 1.4.2 pypi_0 pypi
jsonlines 4.0.0 pypi_0 pypi
jsonschema 4.22.0 pypi_0 pypi
jsonschema-specifications 2023.12.1 pypi_0 pypi
kernel-headers_linux-64 2.6.32 he073ed8_17 conda-forge
kiwisolver 1.4.5 pypi_0 pypi
langcodes 3.4.0 pypi_0 pypi
language-data 1.2.0 pypi_0 pypi
ld_impl_linux-64 2.36.1 hea4e1c9_2 conda-forge
libaio 0.9.3 pypi_0 pypi
libffi 3.4.4 h6a678d5_1 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
libgcc-devel_linux-64 11.2.0 h0952999_16 conda-forge
libgcc-ng 13.2.0 h77fa898_7 conda-forge
libgomp 13.2.0 h77fa898_7 conda-forge
libsanitizer 11.2.0 he4da1e4_16 conda-forge
libstdcxx-devel_linux-64 11.2.0 h0952999_16 conda-forge
libstdcxx-ng 13.2.0 hc0a3c3a_7 conda-forge
libuuid 1.41.5 h5eee18b_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
lit 15.0.7 pypi_0 pypi
lxml 5.2.2 pypi_0 pypi
marisa-trie 1.1.1 pypi_0 pypi
markdown 3.6 pypi_0 pypi
markdown-it-py 3.0.0 pypi_0 pypi
markdown2 2.4.10 pypi_0 pypi
markupsafe 2.1.5 pypi_0 pypi
matplotlib 3.7.4 pypi_0 pypi
mdurl 0.1.2 pypi_0 pypi
more-itertools 10.1.0 pypi_0 pypi
mpmath 1.3.0 pypi_0 pypi
murmurhash 1.0.10 pypi_0 pypi
ncurses 6.4 h6a678d5_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
networkx 3.3 pypi_0 pypi
ninja 1.10.0 pypi_0 pypi
ninja-base 1.10.2 hd09550d_5 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
nltk 3.8.1 pypi_0 pypi
numpy 1.24.4 pypi_0 pypi
nvidia-cublas-cu12 12.1.3.1 pypi_0 pypi
nvidia-cuda-cupti-cu12 12.1.105 pypi_0 pypi
nvidia-cuda-nvrtc-cu12 12.1.105 pypi_0 pypi
nvidia-cuda-runtime-cu12 12.1.105 pypi_0 pypi
nvidia-cudnn-cu12 8.9.2.26 pypi_0 pypi
nvidia-cufft-cu12 11.0.2.54 pypi_0 pypi
nvidia-curand-cu12 10.3.2.106 pypi_0 pypi
nvidia-cusolver-cu12 11.4.5.107 pypi_0 pypi
nvidia-cusparse-cu12 12.1.0.106 pypi_0 pypi
nvidia-ml-py 12.535.161 pypi_0 pypi
nvidia-nccl-cu12 2.18.1 pypi_0 pypi
nvidia-nvjitlink-cu12 12.5.40 pypi_0 pypi
nvidia-nvtx-cu12 12.1.105 pypi_0 pypi
nvitop 1.3.2 pypi_0 pypi
opencv-python-headless 4.5.5.64 pypi_0 pypi
openpyxl 3.1.2 pypi_0 pypi
openssl 3.3.1 h4ab18f5_0 conda-forge
orjson 3.10.3 pypi_0 pypi
packaging 23.2 pypi_0 pypi
pandas 2.2.2 pypi_0 pypi
peft 0.11.1 pypi_0 pypi
pillow 10.1.0 pypi_0 pypi
pip 24.0 py310h06a4308_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
portalocker 2.8.2 pypi_0 pypi
preshed 3.0.9 pypi_0 pypi
protobuf 4.25.0 pypi_0 pypi
psutil 5.9.8 pypi_0 pypi
py-cpuinfo 9.0.0 pypi_0 pypi
pydantic 2.7.2 pypi_0 pypi
pydantic-core 2.18.3 pypi_0 pypi
pydub 0.25.1 pypi_0 pypi
pygments 2.18.0 pypi_0 pypi
pynvml 11.5.0 pypi_0 pypi
pyparsing 3.1.2 pypi_0 pypi
pyproject 1.3.1 pypi_0 pypi
python 3.10.14 h955ad1f_1 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
python-dateutil 2.9.0.post0 pypi_0 pypi
python-multipart 0.0.9 pypi_0 pypi
pytz 2024.1 pypi_0 pypi
pyyaml 6.0.1 pypi_0 pypi
readline 8.2 h5eee18b_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
referencing 0.35.1 pypi_0 pypi
regex 2024.5.15 pypi_0 pypi
requests 2.32.3 pypi_0 pypi
rich 13.7.1 pypi_0 pypi
rpds-py 0.18.1 pypi_0 pypi
ruff 0.4.7 pypi_0 pypi
sacrebleu 2.3.2 pypi_0 pypi
safetensors 0.4.3 pypi_0 pypi
seaborn 0.13.0 pypi_0 pypi
semantic-version 2.10.0 pypi_0 pypi
sentencepiece 0.1.99 pypi_0 pypi
setuptools 69.5.1 py310h06a4308_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
shellingham 1.5.4 pypi_0 pypi
shortuuid 1.0.11 pypi_0 pypi
six 1.16.0 pypi_0 pypi
smart-open 6.4.0 pypi_0 pypi
sniffio 1.3.1 pypi_0 pypi
socksio 1.0.0 pypi_0 pypi
spacy 3.7.2 pypi_0 pypi
spacy-legacy 3.0.12 pypi_0 pypi
spacy-loggers 1.0.5 pypi_0 pypi
sqlite 3.45.3 h5eee18b_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
srsly 2.4.8 pypi_0 pypi
starlette 0.37.2 pypi_0 pypi
sympy 1.12.1 pypi_0 pypi
sysroot_linux-64 2.12 he073ed8_17 conda-forge
tabulate 0.9.0 pypi_0 pypi
tensorboard 2.16.2 pypi_0 pypi
tensorboard-data-server 0.7.2 pypi_0 pypi
tensorboardx 1.8 pypi_0 pypi
termcolor 2.4.0 pypi_0 pypi
thinc 8.2.3 pypi_0 pypi
timm 0.9.10 pypi_0 pypi
tk 8.6.14 h39e8969_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
tokenizers 0.19.1 pypi_0 pypi
tomlkit 0.12.0 pypi_0 pypi
toolz 0.12.1 pypi_0 pypi
torch 2.1.2+cu118 pypi_0 pypi
torchaudio 2.1.2+cu118 pypi_0 pypi
torchvision 0.16.2+cu118 pypi_0 pypi
tqdm 4.66.1 pypi_0 pypi
transformers 4.41.2 pypi_0 pypi
triton 2.1.0 pypi_0 pypi
typer 0.9.4 pypi_0 pypi
typing-extensions 4.8.0 pypi_0 pypi
tzdata 2024.1 pypi_0 pypi
urllib3 2.2.1 pypi_0 pypi
uvicorn 0.24.0.post1 pypi_0 pypi
wasabi 1.1.3 pypi_0 pypi
wcwidth 0.2.13 pypi_0 pypi
weasel 0.3.4 pypi_0 pypi
websockets 11.0.3 pypi_0 pypi
werkzeug 3.0.3 pypi_0 pypi
wheel 0.43.0 py310h06a4308_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
xz 5.4.6 h5eee18b_1 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
zlib 1.2.13 h5eee18b_1 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
备注 | Anything else?
输出:
prepare trainer
Training dataset length: 1
Validation dataset length: 1
<class 'trainer.CPMTrainer'>
trainer ok
错误信息:
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Detected kernel version 3.10.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
max_steps is given, it will override any value given in num_train_epochs
max_steps is given, it will override any value given in num_train_epochs
max_steps is given, it will override any value given in num_train_epochs
max_steps is given, it will override any value given in num_train_epochs
Using /public/home/lzu2/.cache/torch_extensions/py310_cu118 as PyTorch extensions root...
Using /public/home/lzu2/.cache/torch_extensions/py310_cu118 as PyTorch extensions root...
Using /public/home/lzu2/.cache/torch_extensions/py310_cu118 as PyTorch extensions root...
Using /public/home/lzu2/.cache/torch_extensions/py310_cu118 as PyTorch extensions root...
部分代码:
# 检查数据集长度
print(f"Training dataset length: {len(data_module['train_dataset'])}")
print(f"Validation dataset length: {len(data_module['eval_dataset'])}")
rank0_print("prepare trainer")
trainer = CPMTrainer(
model=model,
tokenizer=tokenizer,
args=training_args,
**data_module,
)
rank0_print(type(trainer))
rank0_print("trainer ok")
trainer.train()
trainer.save_state()
rank0_print("trainer sucess")
你的数据集有多大呢?
你的数据集有多大呢?
你的数据集有多大呢?
数据集只有一条数据,是官方demo提供的
如下:
[
{
"id": "0",
"image": "path/image/image_0.jpg",
"conversations": [
{
"role": "user",
"content": "<image>\nHow many desserts are on the white plate?"
},
{
"role": "assistant",
"content": "There are three desserts on the white plate."
},
{
"role": "user",
"content": "What type of desserts are they?"
},
{
"role": "assistant",
"content": "The desserts are cakes with bananas and pecans on top. They share similarities with donuts, but the presence of bananas and pecans differentiates them."
},
{
"role": "user",
"content": "What is the setting of the image?"
},
{
"role": "assistant",
"content": "The image is set on a table top with a plate containing the three desserts."
}
]
}
]
我的环境是这样的 你可以参考一下
requirements.txt
我的linux内核版本是5.4.0,不知道是不是因为你的版本是3.10.0导致的
我的环境是这样的 你可以参考一下 requirements.txt 我的linux内核版本是5.4.0,不知道是不是因为你的版本是3.10.0导致的
感谢您的回答,我发现好像是我NCCL没有安装的原因,但是我安装以后又出现了新的问题,您能帮我看一下吗?谢谢
错误信息:
Traceback (most recent call last):
File "/LLM/openbmb/MiniCPM-V/finetune/finetune.py", line 708, in
train()
File "/LLM/openbmb/MiniCPM-V/finetune/finetune.py", line 690, in train
trainer.train()
File "/envs/llm/lib/python3.10/site-packages/transformers/trainer.py", line 1885, in train
return inner_training_loop(
File "/envs/llm/lib/python3.10/site-packages/transformers/trainer.py", line 2045, in _inner_training_loop
model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(
File "/envs/llm/lib/python3.10/site-packages/accelerate/accelerator.py", line 1284, in prepare
result = self._prepare_deepspeed(*args)
File "/envs/llm/lib/python3.10/site-packages/accelerate/accelerator.py", line 1751, in _prepare_deepspeed
engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
File "/envs/llm/lib/python3.10/site-packages/deepspeed/init.py", line 181, in initialize
engine = DeepSpeedEngine(args=args,
File "/envs/llm/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 262, in init
self._configure_distributed_model(model)
File "/envs/llm/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1149, in _configure_distributed_model
self._broadcast_model()
File "/envs/llm/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1069, in _broadcast_model
dist.broadcast(p.data, groups._get_broadcast_src_rank(), group=self.seq_data_parallel_group)
File "/envs/llm/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper
return func(*args, **kwargs)
File "/envs/llm/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 224, in broadcast
return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)
File "/envs/llm/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 328, in _fn
return fn(*args, **kwargs)
File "/envs/llm/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 199, in broadcast
return torch.distributed.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)
File "/envs/llm/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
return func(*args, **kwargs)
File "/envs/llm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1910, in broadcast
work = group.broadcast([tensor], opts)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1251, internal error - please report this issue to the NCCL developers, NCCL version 2.18.6
ncclInternalError: Internal check failed.
Last error:
Bootstrap : no socket interface found
[2024-06-14 16:05:19,306] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 30898 closing signal SIGTERM
[2024-06-14 16:05:19,306] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 30899 closing signal SIGTERM
[2024-06-14 16:05:19,307] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 30900 closing signal SIGTERM
[2024-06-14 16:05:20,135] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 30897) of binary: /envs/llm/bin/python
输出信息:
prepare trainer
trainer ok
[2024-06-14 16:05:12,128] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.14.3, git-hash=unknown, git-branch=unknown
gpu009:30897:30897 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ens1f1
gpu009:30897:30897 [0] NCCL INFO NCCL_SOCKET_IFNAME set to ens1f1
gpu009:30897:30897 [0] bootstrap.cc:45 NCCL WARN Bootstrap : no socket interface found
gpu009:30897:30897 [0] NCCL INFO init.cc:82 -> 3
gpu009:30897:30897 [0] NCCL INFO init.cc:101 -> 3
从日志信息看,我好像是没有正确设置网络接口,但是我使用ifconfig命令查找的时候是有ens1f1这个接口的,并且也可以ping通。
麻烦您帮我看一下,谢谢!!!
我的环境是这样的 你可以参考一下 requirements.txt 我的linux内核版本是5.4.0,不知道是不是因为你的版本是3.10.0导致的
感谢您的回答,我发现好像是我NCCL没有安装的原因,但是我安装以后又出现了新的问题,您能帮我看一下吗?谢谢 错误信息: Traceback (most recent call last): File "/LLM/openbmb/MiniCPM-V/finetune/finetune.py", line 708, in train() File "/LLM/openbmb/MiniCPM-V/finetune/finetune.py", line 690, in train trainer.train() File "/envs/llm/lib/python3.10/site-packages/transformers/trainer.py", line 1885, in train return inner_training_loop( File "/envs/llm/lib/python3.10/site-packages/transformers/trainer.py", line 2045, in _inner_training_loop model, self.optimizer, self.lr_scheduler = self.accelerator.prepare( File "/envs/llm/lib/python3.10/site-packages/accelerate/accelerator.py", line 1284, in prepare result = self._prepare_deepspeed(*args) File "/envs/llm/lib/python3.10/site-packages/accelerate/accelerator.py", line 1751, in _prepare_deepspeed engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs) File "/envs/llm/lib/python3.10/site-packages/deepspeed/init.py", line 181, in initialize engine = DeepSpeedEngine(args=args, File "/envs/llm/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 262, in init self._configure_distributed_model(model) File "/envs/llm/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1149, in _configure_distributed_model self._broadcast_model() File "/envs/llm/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1069, in _broadcast_model dist.broadcast(p.data, groups._get_broadcast_src_rank(), group=self.seq_data_parallel_group) File "/envs/llm/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper return func(*args, **kwargs) File "/envs/llm/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 224, in broadcast return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) File "/envs/llm/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 328, in _fn return fn(*args, **kwargs) File "/envs/llm/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 199, in broadcast return torch.distributed.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) File "/envs/llm/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper return func(*args, **kwargs) File "/envs/llm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1910, in broadcast work = group.broadcast([tensor], opts) torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1251, internal error - please report this issue to the NCCL developers, NCCL version 2.18.6 ncclInternalError: Internal check failed. Last error: Bootstrap : no socket interface found [2024-06-14 16:05:19,306] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 30898 closing signal SIGTERM [2024-06-14 16:05:19,306] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 30899 closing signal SIGTERM [2024-06-14 16:05:19,307] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 30900 closing signal SIGTERM [2024-06-14 16:05:20,135] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 30897) of binary: /envs/llm/bin/python
输出信息: prepare trainer trainer ok [2024-06-14 16:05:12,128] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.14.3, git-hash=unknown, git-branch=unknown gpu009:30897:30897 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ens1f1 gpu009:30897:30897 [0] NCCL INFO NCCL_SOCKET_IFNAME set to ens1f1 gpu009:30897:30897 [0] bootstrap.cc:45 NCCL WARN Bootstrap : no socket interface found gpu009:30897:30897 [0] NCCL INFO init.cc:82 -> 3 gpu009:30897:30897 [0] NCCL INFO init.cc:101 -> 3
从日志信息看,我好像是没有正确设置网络接口,但是我使用ifconfig命令查找的时候是有ens1f1这个接口的,并且也可以ping通。 麻烦您帮我看一下,谢谢!!!
当我把网络接口切换到ib0的时候,它不会报错,但是根据NCCL日志信息,它还是处于挂起状态,没有训练
输出信息:
prepare trainer
Training dataset length: 1
Validation dataset length: 1
trainer ok
[2024-06-14 16:59:42,697] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.14.3, git-hash=unknown, git-branch=unknown
gpu009:47867:47867 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ib0
gpu009:47867:47867 [0] NCCL INFO NCCL_SOCKET_IFNAME set to ib0
gpu009:47867:47867 [0] NCCL INFO Bootstrap : Using ib0:11.11.8.9<0>
gpu009:47867:47867 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
gpu009:47867:47867 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation
gpu009:47868:47868 [1] NCCL INFO cudaDriverVersion 12000
gpu009:47869:47869 [2] NCCL INFO cudaDriverVersion 12000
gpu009:47868:47868 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ib0
gpu009:47868:47868 [1] NCCL INFO NCCL_SOCKET_IFNAME set to ib0
gpu009:47869:47869 [2] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ib0
gpu009:47869:47869 [2] NCCL INFO NCCL_SOCKET_IFNAME set to ib0
gpu009:47870:47870 [3] NCCL INFO cudaDriverVersion 12000
gpu009:47868:47868 [1] NCCL INFO Bootstrap : Using ib0:11.11.8.9<0>
gpu009:47869:47869 [2] NCCL INFO Bootstrap : Using ib0:11.11.8.9<0>
gpu009:47869:47869 [2] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
gpu009:47868:47868 [1] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
gpu009:47869:47869 [2] NCCL INFO NET/Plugin : No plugin found, using internal implementation
gpu009:47868:47868 [1] NCCL INFO NET/Plugin : No plugin found, using internal implementation
gpu009:47870:47870 [3] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ib0
gpu009:47870:47870 [3] NCCL INFO NCCL_SOCKET_IFNAME set to ib0
gpu009:47869:47869 [2] NCCL INFO init.cc:1584 Cuda Host Alloc Size 4 pointer 0x7fc46e800000
gpu009:47868:47868 [1] NCCL INFO init.cc:1584 Cuda Host Alloc Size 4 pointer 0x7fc390800000
gpu009:47870:47870 [3] NCCL INFO Bootstrap : Using ib0:11.11.8.9<0>
gpu009:47870:47870 [3] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
gpu009:47870:47870 [3] NCCL INFO NET/Plugin : No plugin found, using internal implementation
gpu009:47870:47870 [3] NCCL INFO init.cc:1584 Cuda Host Alloc Size 4 pointer 0x7fbfc0800000
gpu009:47869:48590 [2] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
gpu009:47869:48590 [2] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ib0
gpu009:47869:48590 [2] NCCL INFO NET/Socket : Using [0]ib0:11.11.8.9<0>
gpu009:47869:48590 [2] NCCL INFO Using network Socket
gpu009:47868:48591 [1] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
gpu009:47868:48591 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ib0
gpu009:47867:47867 [0] NCCL INFO cudaDriverVersion 12000
NCCL version 2.18.6+cuda12.1
gpu009:47868:48591 [1] NCCL INFO NET/Socket : Using [0]ib0:11.11.8.9<0>
gpu009:47868:48591 [1] NCCL INFO Using network Socket
gpu009:47870:48592 [3] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
gpu009:47867:47867 [0] NCCL INFO init.cc:1584 Cuda Host Alloc Size 4 pointer 0x7f2adc800000
gpu009:47870:48592 [3] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ib0
gpu009:47870:48592 [3] NCCL INFO NET/Socket : Using [0]ib0:11.11.8.9<0>
gpu009:47870:48592 [3] NCCL INFO Using network Socket
gpu009:47867:48593 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
gpu009:47867:48593 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ib0
gpu009:47867:48593 [0] NCCL INFO NET/Socket : Using [0]ib0:11.11.8.9<0>
gpu009:47867:48593 [0] NCCL INFO Using network Socket
gpu009:47867:48593 [0] NCCL INFO comm 0x7ef5c880 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 50000 commId 0x361b5540a6088610 - Init START
gpu009:47870:48592 [3] NCCL INFO comm 0x68da1c00 rank 3 nranks 4 cudaDev 3 nvmlDev 3 busId 9c000 commId 0x361b5540a6088610 - Init START
gpu009:47869:48590 [2] NCCL INFO comm 0x69374b40 rank 2 nranks 4 cudaDev 2 nvmlDev 2 busId 57000 commId 0x361b5540a6088610 - Init START
gpu009:47868:48591 [1] NCCL INFO comm 0x68b107c0 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 53000 commId 0x361b5540a6088610 - Init START
gpu009:47870:48592 [3] NCCL INFO NET/Socket : GPU Direct RDMA Disabled for HCA 0 'ib0'
gpu009:47868:48591 [1] NCCL INFO NET/Socket : GPU Direct RDMA Disabled for HCA 0 'ib0'
gpu009:47869:48590 [2] NCCL INFO NET/Socket : GPU Direct RDMA Disabled for HCA 0 'ib0'
gpu009:47867:48593 [0] NCCL INFO NET/Socket : GPU Direct RDMA Disabled for HCA 0 'ib0'
gpu009:47870:48592 [3] NCCL INFO NCCL_P2P_LEVEL set by environment to NVL
gpu009:47870:48592 [3] NCCL INFO === System : maxBw 24.0 totalBw 24.0 ===
gpu009:47870:48592 [3] NCCL INFO CPU/0 (1/1/2)
gpu009:47870:48592 [3] NCCL INFO + PCI[24.0] - PCI/4B000 (1000c01010000000)
gpu009:47870:48592 [3] NCCL INFO + PCI[24.0] - GPU/50000 (0)
gpu009:47870:48592 [3] NCCL INFO + PCI[24.0] - GPU/53000 (1)
gpu009:47870:48592 [3] NCCL INFO + PCI[24.0] - NIC/56000
gpu009:47870:48592 [3] NCCL INFO + PCI[24.0] - GPU/57000 (2)
gpu009:47870:48592 [3] NCCL INFO + SYS[10.0] - CPU/1
gpu009:47870:48592 [3] NCCL INFO CPU/1 (1/1/2)
gpu009:47870:48592 [3] NCCL INFO + PCI[24.0] - PCI/98000 (1000c01010000000)
gpu009:47870:48592 [3] NCCL INFO + PCI[24.0] - GPU/9C000 (3)
gpu009:47870:48592 [3] NCCL INFO + SYS[10.0] - CPU/0
gpu009:47870:48592 [3] NCCL INFO ==========================================
gpu009:47870:48592 [3] NCCL INFO GPU/50000 :GPU/50000 (0/5000.000000/LOC) GPU/53000 (4/24.000000/PHB) GPU/57000 (4/24.000000/PHB) GPU/9C000 (5/10.000000/SYS) CPU/0 (2/24.000000/PHB) CPU/1 (3/10.000000/SYS)
gpu009:47868:48591 [1] NCCL INFO NCCL_P2P_LEVEL set by environment to NVL
gpu009:47870:48592 [3] NCCL INFO GPU/53000 :GPU/50000 (4/24.000000/PHB) GPU/53000 (0/5000.000000/LOC) GPU/57000 (4/24.000000/PHB) GPU/9C000 (5/10.000000/SYS) CPU/0 (2/24.000000/PHB) CPU/1 (3/10.000000/SYS)
gpu009:47870:48592 [3] NCCL INFO GPU/57000 :GPU/50000 (4/24.000000/PHB) GPU/53000 (4/24.000000/PHB) GPU/57000 (0/5000.000000/LOC) GPU/9C000 (5/10.000000/SYS) CPU/0 (2/24.000000/PHB) CPU/1 (3/10.000000/SYS)
gpu009:47870:48592 [3] NCCL INFO GPU/9C000 :GPU/50000 (5/10.000000/SYS) GPU/53000 (5/10.000000/SYS) GPU/57000 (5/10.000000/SYS) GPU/9C000 (0/5000.000000/LOC) CPU/0 (3/10.000000/SYS) CPU/1 (2/24.000000/PHB)
gpu009:47870:48592 [3] NCCL INFO Setting affinity for GPU 3 to 3ff00000,0000003f,f0000000
gpu009:47870:48592 [3] NCCL INFO NVLS multicast support is not available on dev 3
gpu009:47869:48590 [2] NCCL INFO NCCL_P2P_LEVEL set by environment to NVL
gpu009:47867:48593 [0] NCCL INFO NCCL_P2P_LEVEL set by environment to NVL
gpu009:47868:48591 [1] NCCL INFO === System : maxBw 24.0 totalBw 24.0 ===
gpu009:47868:48591 [1] NCCL INFO CPU/0 (1/1/2)
gpu009:47868:48591 [1] NCCL INFO + PCI[24.0] - PCI/4B000 (1000c01010000000)
gpu009:47868:48591 [1] NCCL INFO + PCI[24.0] - GPU/50000 (0)
gpu009:47868:48591 [1] NCCL INFO + PCI[24.0] - GPU/53000 (1)
gpu009:47868:48591 [1] NCCL INFO + PCI[24.0] - NIC/56000
gpu009:47868:48591 [1] NCCL INFO + PCI[24.0] - GPU/57000 (2)
gpu009:47868:48591 [1] NCCL INFO + SYS[10.0] - CPU/1
gpu009:47868:48591 [1] NCCL INFO CPU/1 (1/1/2)
gpu009:47868:48591 [1] NCCL INFO + PCI[24.0] - PCI/98000 (1000c01010000000)
gpu009:47868:48591 [1] NCCL INFO + PCI[24.0] - GPU/9C000 (3)
gpu009:47869:48590 [2] NCCL INFO === System : maxBw 24.0 totalBw 24.0 ===
gpu009:47868:48591 [1] NCCL INFO + SYS[10.0] - CPU/0
gpu009:47867:48593 [0] NCCL INFO === System : maxBw 24.0 totalBw 24.0 ===
gpu009:47869:48590 [2] NCCL INFO CPU/0 (1/1/2)
gpu009:47868:48591 [1] NCCL INFO ==========================================
gpu009:47867:48593 [0] NCCL INFO CPU/0 (1/1/2)
gpu009:47869:48590 [2] NCCL INFO + PCI[24.0] - PCI/4B000 (1000c01010000000)
gpu009:47868:48591 [1] NCCL INFO GPU/50000 :GPU/50000 (0/5000.000000/LOC) GPU/53000 (4/24.000000/PHB) GPU/57000 (4/24.000000/PHB) GPU/9C000 (5/10.000000/SYS) CPU/0 (2/24.000000/PHB) CPU/1 (3/10.000000/SYS)
gpu009:47867:48593 [0] NCCL INFO + PCI[24.0] - PCI/4B000 (1000c01010000000)
gpu009:47869:48590 [2] NCCL INFO + PCI[24.0] - GPU/50000 (0)
gpu009:47868:48591 [1] NCCL INFO GPU/53000 :GPU/50000 (4/24.000000/PHB) GPU/53000 (0/5000.000000/LOC) GPU/57000 (4/24.000000/PHB) GPU/9C000 (5/10.000000/SYS) CPU/0 (2/24.000000/PHB) CPU/1 (3/10.000000/SYS)
gpu009:47867:48593 [0] NCCL INFO + PCI[24.0] - GPU/50000 (0)
gpu009:47869:48590 [2] NCCL INFO + PCI[24.0] - GPU/53000 (1)
gpu009:47868:48591 [1] NCCL INFO GPU/57000 :GPU/50000 (4/24.000000/PHB) GPU/53000 (4/24.000000/PHB) GPU/57000 (0/5000.000000/LOC) GPU/9C000 (5/10.000000/SYS) CPU/0 (2/24.000000/PHB) CPU/1 (3/10.000000/SYS)
gpu009:47867:48593 [0] NCCL INFO + PCI[24.0] - GPU/53000 (1)
gpu009:47869:48590 [2] NCCL INFO + PCI[24.0] - NIC/56000
gpu009:47868:48591 [1] NCCL INFO GPU/9C000 :GPU/50000 (5/10.000000/SYS) GPU/53000 (5/10.000000/SYS) GPU/57000 (5/10.000000/SYS) GPU/9C000 (0/5000.000000/LOC) CPU/0 (3/10.000000/SYS) CPU/1 (2/24.000000/PHB)
gpu009:47867:48593 [0] NCCL INFO + PCI[24.0] - NIC/56000
gpu009:47869:48590 [2] NCCL INFO + PCI[24.0] - GPU/57000 (2)
gpu009:47870:48592 [3] NCCL INFO Pattern 4, crossNic 0, nChannels 1, bw 7.000000/7.000000, type SYS/PIX, sameChannels 1
gpu009:47868:48591 [1] NCCL INFO Setting affinity for GPU 1 to 03ff,00000000,0003ff00
gpu009:47867:48593 [0] NCCL INFO + PCI[24.0] - GPU/57000 (2)
gpu009:47869:48590 [2] NCCL INFO + SYS[10.0] - CPU/1
gpu009:47870:48592 [3] NCCL INFO 0 : GPU/0 GPU/1 GPU/2 GPU/3
gpu009:47868:48591 [1] NCCL INFO NVLS multicast support is not available on dev 1
gpu009:47867:48593 [0] NCCL INFO + SYS[10.0] - CPU/1
gpu009:47869:48590 [2] NCCL INFO CPU/1 (1/1/2)
gpu009:47867:48593 [0] NCCL INFO CPU/1 (1/1/2)
gpu009:47869:48590 [2] NCCL INFO + PCI[24.0] - PCI/98000 (1000c01010000000)
gpu009:47867:48593 [0] NCCL INFO + PCI[24.0] - PCI/98000 (1000c01010000000)
gpu009:47870:48592 [3] NCCL INFO Pattern 1, crossNic 0, nChannels 1, bw 10.000000/10.000000, type SYS/PIX, sameChannels 1
gpu009:47869:48590 [2] NCCL INFO + PCI[24.0] - GPU/9C000 (3)
gpu009:47867:48593 [0] NCCL INFO + PCI[24.0] - GPU/9C000 (3)
gpu009:47870:48592 [3] NCCL INFO 0 : GPU/0 GPU/1 GPU/3 GPU/2
gpu009:47869:48590 [2] NCCL INFO + SYS[10.0] - CPU/0
gpu009:47867:48593 [0] NCCL INFO + SYS[10.0] - CPU/0
gpu009:47869:48590 [2] NCCL INFO ==========================================
gpu009:47867:48593 [0] NCCL INFO ==========================================
gpu009:47869:48590 [2] NCCL INFO GPU/50000 :GPU/50000 (0/5000.000000/LOC) GPU/53000 (4/24.000000/PHB) GPU/57000 (4/24.000000/PHB) GPU/9C000 (5/10.000000/SYS) CPU/0 (2/24.000000/PHB) CPU/1 (3/10.000000/SYS)
gpu009:47867:48593 [0] NCCL INFO GPU/50000 :GPU/50000 (0/5000.000000/LOC) GPU/53000 (4/24.000000/PHB) GPU/57000 (4/24.000000/PHB) GPU/9C000 (5/10.000000/SYS) CPU/0 (2/24.000000/PHB) CPU/1 (3/10.000000/SYS)
gpu009:47869:48590 [2] NCCL INFO GPU/53000 :GPU/50000 (4/24.000000/PHB) GPU/53000 (0/5000.000000/LOC) GPU/57000 (4/24.000000/PHB) GPU/9C000 (5/10.000000/SYS) CPU/0 (2/24.000000/PHB) CPU/1 (3/10.000000/SYS)
gpu009:47867:48593 [0] NCCL INFO GPU/53000 :GPU/50000 (4/24.000000/PHB) GPU/53000 (0/5000.000000/LOC) GPU/57000 (4/24.000000/PHB) GPU/9C000 (5/10.000000/SYS) CPU/0 (2/24.000000/PHB) CPU/1 (3/10.000000/SYS)
gpu009:47869:48590 [2] NCCL INFO GPU/57000 :GPU/50000 (4/24.000000/PHB) GPU/53000 (4/24.000000/PHB) GPU/57000 (0/5000.000000/LOC) GPU/9C000 (5/10.000000/SYS) CPU/0 (2/24.000000/PHB) CPU/1 (3/10.000000/SYS)
gpu009:47867:48593 [0] NCCL INFO GPU/57000 :GPU/50000 (4/24.000000/PHB) GPU/53000 (4/24.000000/PHB) GPU/57000 (0/5000.000000/LOC) GPU/9C000 (5/10.000000/SYS) CPU/0 (2/24.000000/PHB) CPU/1 (3/10.000000/SYS)
gpu009:47869:48590 [2] NCCL INFO GPU/9C000 :GPU/50000 (5/10.000000/SYS) GPU/53000 (5/10.000000/SYS) GPU/57000 (5/10.000000/SYS) GPU/9C000 (0/5000.000000/LOC) CPU/0 (3/10.000000/SYS) CPU/1 (2/24.000000/PHB)
gpu009:47867:48593 [0] NCCL INFO GPU/9C000 :GPU/50000 (5/10.000000/SYS) GPU/53000 (5/10.000000/SYS) GPU/57000 (5/10.000000/SYS) GPU/9C000 (0/5000.000000/LOC) CPU/0 (3/10.000000/SYS) CPU/1 (2/24.000000/PHB)
gpu009:47869:48590 [2] NCCL INFO Setting affinity for GPU 2 to 03ff,00000000,0003ff00
gpu009:47867:48593 [0] NCCL INFO Setting affinity for GPU 0 to 03ff,00000000,0003ff00
gpu009:47869:48590 [2] NCCL INFO NVLS multicast support is not available on dev 2
gpu009:47867:48593 [0] NCCL INFO NVLS multicast support is not available on dev 0
gpu009:47868:48591 [1] NCCL INFO Pattern 4, crossNic 0, nChannels 1, bw 7.000000/7.000000, type SYS/PIX, sameChannels 1
gpu009:47868:48591 [1] NCCL INFO 0 : GPU/0 GPU/1 GPU/2 GPU/3
gpu009:47868:48591 [1] NCCL INFO Pattern 1, crossNic 0, nChannels 1, bw 10.000000/10.000000, type SYS/PIX, sameChannels 1
gpu009:47868:48591 [1] NCCL INFO 0 : GPU/0 GPU/1 GPU/3 GPU/2
gpu009:47869:48590 [2] NCCL INFO Pattern 4, crossNic 0, nChannels 1, bw 7.000000/7.000000, type SYS/PIX, sameChannels 1
gpu009:47869:48590 [2] NCCL INFO 0 : GPU/0 GPU/1 GPU/2 GPU/3
gpu009:47867:48593 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 1, bw 7.000000/7.000000, type SYS/PIX, sameChannels 1
gpu009:47867:48593 [0] NCCL INFO 0 : GPU/0 GPU/1 GPU/2 GPU/3
gpu009:47869:48590 [2] NCCL INFO Pattern 1, crossNic 0, nChannels 1, bw 10.000000/10.000000, type SYS/PIX, sameChannels 1
gpu009:47869:48590 [2] NCCL INFO 0 : GPU/0 GPU/1 GPU/3 GPU/2
gpu009:47867:48593 [0] NCCL INFO Pattern 1, crossNic 0, nChannels 1, bw 10.000000/10.000000, type SYS/PIX, sameChannels 1
gpu009:47867:48593 [0] NCCL INFO 0 : GPU/0 GPU/1 GPU/3 GPU/2
gpu009:47870:48592 [3] NCCL INFO Ring 00 : 2 -> 3 -> 0
gpu009:47870:48592 [3] NCCL INFO Ring 01 : 2 -> 3 -> 0
gpu009:47868:48591 [1] NCCL INFO Tree 0 : 0 -> 1 -> 3/-1/-1
gpu009:47870:48592 [3] NCCL INFO Trees [0] 2/-1/-1->3->1 [1] 2/-1/-1->3->1
gpu009:47867:48593 [0] NCCL INFO Tree 0 : -1 -> 0 -> 1/-1/-1
gpu009:47868:48591 [1] NCCL INFO Tree 1 : 0 -> 1 -> 3/-1/-1
gpu009:47869:48590 [2] NCCL INFO Ring 00 : 1 -> 2 -> 3
gpu009:47870:48592 [3] NCCL INFO P2P Chunksize set to 131072
gpu009:47867:48593 [0] NCCL INFO Tree 1 : -1 -> 0 -> 1/-1/-1
gpu009:47868:48591 [1] NCCL INFO Ring 00 : 0 -> 1 -> 2
gpu009:47869:48590 [2] NCCL INFO Ring 01 : 1 -> 2 -> 3
gpu009:47868:48591 [1] NCCL INFO Ring 01 : 0 -> 1 -> 2
gpu009:47869:48590 [2] NCCL INFO Trees [0] -1/-1/-1->2->3 [1] -1/-1/-1->2->3
gpu009:47867:48593 [0] NCCL INFO Channel 00/02 : 0 1 2 3
gpu009:47870:48592 [3] NCCL INFO misc/utils.cc:235 memory stack hunk malloc(65536)
gpu009:47868:48591 [1] NCCL INFO Trees [0] 3/-1/-1->1->0 [1] 3/-1/-1->1->0
gpu009:47869:48590 [2] NCCL INFO P2P Chunksize set to 131072
gpu009:47867:48593 [0] NCCL INFO Channel 01/02 : 0 1 2 3
gpu009:47868:48591 [1] NCCL INFO P2P Chunksize set to 131072
gpu009:47867:48593 [0] NCCL INFO Ring 00 : 3 -> 0 -> 1
gpu009:47867:48593 [0] NCCL INFO Ring 01 : 3 -> 0 -> 1
gpu009:47867:48593 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
gpu009:47867:48593 [0] NCCL INFO P2P Chunksize set to 131072
gpu009:47867:48593 [0] NCCL INFO misc/utils.cc:235 memory stack hunk malloc(65536)
gpu009:47868:48591 [1] NCCL INFO misc/utils.cc:235 memory stack hunk malloc(65536)
gpu009:47869:48590 [2] NCCL INFO misc/utils.cc:235 memory stack hunk malloc(65536)
gpu009:47870:48592 [3] NCCL INFO channel.cc:40 Cuda Alloc Size 1536 pointer 0x7fbfc1a00000
gpu009:47870:48592 [3] NCCL INFO channel.cc:43 Cuda Alloc Size 72 pointer 0x7fbfc1a00600
gpu009:47870:48592 [3] NCCL INFO channel.cc:54 Cuda Alloc Size 16 pointer 0x7fbfc1a00800
gpu009:47870:48592 [3] NCCL INFO channel.cc:40 Cuda Alloc Size 1536 pointer 0x7fbfc1a00a00
gpu009:47870:48592 [3] NCCL INFO channel.cc:43 Cuda Alloc Size 72 pointer 0x7fbfc1a01000
gpu009:47870:48592 [3] NCCL INFO channel.cc:54 Cuda Alloc Size 16 pointer 0x7fbfc1a01200
gpu009:47870:48592 [3] NCCL INFO Allocated 9637892 bytes of shared memory in /dev/shm/nccl-AP8lNO
gpu009:47867:48593 [0] NCCL INFO channel.cc:40 Cuda Alloc Size 1536 pointer 0x7f2adda00000
gpu009:47868:48591 [1] NCCL INFO channel.cc:40 Cuda Alloc Size 1536 pointer 0x7fc391a00000