OpenBMB / MiniCPM-V

MiniCPM-Llama3-V 2.5: A GPT-4V Level Multimodal LLM on Your Phone

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[BUG] <title>模型卡在trainer.train()一直不训练

limllzu opened this issue · comments

是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?

  • 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions

该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?

  • 我已经搜索过FAQ | I have searched FAQ

当前行为 | Current Behavior

数据集加载都没有问题,模型一直卡在finetune.py文件中的trainer.trian()

期望行为 | Expected Behavior

No response

复现方法 | Steps To Reproduce

数据:
[
{
"id": "0",
"image": "path/image/001.jpg",
"conversations": [
{
"role": "user",
"content": "\nHow many desserts are on the white plate?"
},
{
"role": "assistant",
"content": "There are three desserts on the white plate."
},
{
"role": "user",
"content": "What type of desserts are they?"
},
{
"role": "assistant",
"content": "The desserts are cakes with bananas and pecans on top. They share similarities with donuts, but the presence of bananas and pecans differentiates them."
},
{
"role": "user",
"content": "What is the setting of the image?"
},
{
"role": "assistant",
"content": "The image is set on a table top with a plate containing the three desserts."
}
]
}
]

运行环境 | Environment

包环境;
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       2_gnu    conda-forge
absl-py                   2.1.0                    pypi_0    pypi
accelerate                0.30.1                   pypi_0    pypi
addict                    2.4.0                    pypi_0    pypi
aiofiles                  23.2.1                   pypi_0    pypi
altair                    5.3.0                    pypi_0    pypi
annotated-types           0.7.0                    pypi_0    pypi
anyio                     4.4.0                    pypi_0    pypi
attrs                     23.2.0                   pypi_0    pypi
binutils_impl_linux-64    2.36.1               h193b22a_2    conda-forge
binutils_linux-64         2.36                hf3e587d_10    conda-forge
bitsandbytes-cuda114      0.26.0.post2             pypi_0    pypi
blessed                   1.20.0                   pypi_0    pypi
blinker                   1.8.2                    pypi_0    pypi
blis                      0.7.11                   pypi_0    pypi
bzip2                     1.0.8                h5eee18b_6    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
ca-certificates           2024.6.2             hbcca054_0    conda-forge
cachetools                5.3.3                    pypi_0    pypi
catalogue                 2.0.10                   pypi_0    pypi
certifi                   2024.2.2                 pypi_0    pypi
charset-normalizer        3.3.2                    pypi_0    pypi
click                     8.1.7                    pypi_0    pypi
cloudpathlib              0.16.0                   pypi_0    pypi
cmake                     3.25.0                   pypi_0    pypi
colorama                  0.4.6                    pypi_0    pypi
confection                0.1.5                    pypi_0    pypi
contourpy                 1.2.1                    pypi_0    pypi
cycler                    0.12.1                   pypi_0    pypi
cymem                     2.0.8                    pypi_0    pypi
deepspeed                 0.14.3                   pypi_0    pypi
editdistance              0.6.2                    pypi_0    pypi
einops                    0.7.0                    pypi_0    pypi
et-xmlfile                1.1.0                    pypi_0    pypi
exceptiongroup            1.2.1                    pypi_0    pypi
fairscale                 0.4.0                    pypi_0    pypi
fastapi                   0.110.3                  pypi_0    pypi
ffmpy                     0.3.2                    pypi_0    pypi
filelock                  3.14.0                   pypi_0    pypi
flask                     3.0.3                    pypi_0    pypi
fonttools                 4.53.0                   pypi_0    pypi
fsspec                    2024.5.0                 pypi_0    pypi
gcc_impl_linux-64         11.2.0              h82a94d6_16    conda-forge
gcc_linux-64              11.2.0              h39a9532_10    conda-forge
gpustat                   1.1.1                    pypi_0    pypi
gradio                    4.26.0                   pypi_0    pypi
gradio-client             0.15.1                   pypi_0    pypi
grpcio                    1.64.1                   pypi_0    pypi
gxx_impl_linux-64         11.2.0              h82a94d6_16    conda-forge
gxx_linux-64              11.2.0              hacbe6df_10    conda-forge
h11                       0.14.0                   pypi_0    pypi
hjson                     3.1.0                    pypi_0    pypi
httpcore                  1.0.5                    pypi_0    pypi
httpx                     0.27.0                   pypi_0    pypi
huggingface-hub           0.23.2                   pypi_0    pypi
idna                      3.7                      pypi_0    pypi
importlib-resources       6.4.0                    pypi_0    pypi
install                   1.3.5                    pypi_0    pypi
itsdangerous              2.2.0                    pypi_0    pypi
jinja2                    3.1.4                    pypi_0    pypi
joblib                    1.4.2                    pypi_0    pypi
jsonlines                 4.0.0                    pypi_0    pypi
jsonschema                4.22.0                   pypi_0    pypi
jsonschema-specifications 2023.12.1                pypi_0    pypi
kernel-headers_linux-64   2.6.32              he073ed8_17    conda-forge
kiwisolver                1.4.5                    pypi_0    pypi
langcodes                 3.4.0                    pypi_0    pypi
language-data             1.2.0                    pypi_0    pypi
ld_impl_linux-64          2.36.1               hea4e1c9_2    conda-forge
libaio                    0.9.3                    pypi_0    pypi
libffi                    3.4.4                h6a678d5_1    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
libgcc-devel_linux-64     11.2.0              h0952999_16    conda-forge
libgcc-ng                 13.2.0               h77fa898_7    conda-forge
libgomp                   13.2.0               h77fa898_7    conda-forge
libsanitizer              11.2.0              he4da1e4_16    conda-forge
libstdcxx-devel_linux-64  11.2.0              h0952999_16    conda-forge
libstdcxx-ng              13.2.0               hc0a3c3a_7    conda-forge
libuuid                   1.41.5               h5eee18b_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
lit                       15.0.7                   pypi_0    pypi
lxml                      5.2.2                    pypi_0    pypi
marisa-trie               1.1.1                    pypi_0    pypi
markdown                  3.6                      pypi_0    pypi
markdown-it-py            3.0.0                    pypi_0    pypi
markdown2                 2.4.10                   pypi_0    pypi
markupsafe                2.1.5                    pypi_0    pypi
matplotlib                3.7.4                    pypi_0    pypi
mdurl                     0.1.2                    pypi_0    pypi
more-itertools            10.1.0                   pypi_0    pypi
mpmath                    1.3.0                    pypi_0    pypi
murmurhash                1.0.10                   pypi_0    pypi
ncurses                   6.4                  h6a678d5_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
networkx                  3.3                      pypi_0    pypi
ninja                     1.10.0                   pypi_0    pypi
ninja-base                1.10.2               hd09550d_5    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
nltk                      3.8.1                    pypi_0    pypi
numpy                     1.24.4                   pypi_0    pypi
nvidia-cublas-cu12        12.1.3.1                 pypi_0    pypi
nvidia-cuda-cupti-cu12    12.1.105                 pypi_0    pypi
nvidia-cuda-nvrtc-cu12    12.1.105                 pypi_0    pypi
nvidia-cuda-runtime-cu12  12.1.105                 pypi_0    pypi
nvidia-cudnn-cu12         8.9.2.26                 pypi_0    pypi
nvidia-cufft-cu12         11.0.2.54                pypi_0    pypi
nvidia-curand-cu12        10.3.2.106               pypi_0    pypi
nvidia-cusolver-cu12      11.4.5.107               pypi_0    pypi
nvidia-cusparse-cu12      12.1.0.106               pypi_0    pypi
nvidia-ml-py              12.535.161               pypi_0    pypi
nvidia-nccl-cu12          2.18.1                   pypi_0    pypi
nvidia-nvjitlink-cu12     12.5.40                  pypi_0    pypi
nvidia-nvtx-cu12          12.1.105                 pypi_0    pypi
nvitop                    1.3.2                    pypi_0    pypi
opencv-python-headless    4.5.5.64                 pypi_0    pypi
openpyxl                  3.1.2                    pypi_0    pypi
openssl                   3.3.1                h4ab18f5_0    conda-forge
orjson                    3.10.3                   pypi_0    pypi
packaging                 23.2                     pypi_0    pypi
pandas                    2.2.2                    pypi_0    pypi
peft                      0.11.1                   pypi_0    pypi
pillow                    10.1.0                   pypi_0    pypi
pip                       24.0            py310h06a4308_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
portalocker               2.8.2                    pypi_0    pypi
preshed                   3.0.9                    pypi_0    pypi
protobuf                  4.25.0                   pypi_0    pypi
psutil                    5.9.8                    pypi_0    pypi
py-cpuinfo                9.0.0                    pypi_0    pypi
pydantic                  2.7.2                    pypi_0    pypi
pydantic-core             2.18.3                   pypi_0    pypi
pydub                     0.25.1                   pypi_0    pypi
pygments                  2.18.0                   pypi_0    pypi
pynvml                    11.5.0                   pypi_0    pypi
pyparsing                 3.1.2                    pypi_0    pypi
pyproject                 1.3.1                    pypi_0    pypi
python                    3.10.14              h955ad1f_1    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
python-dateutil           2.9.0.post0              pypi_0    pypi
python-multipart          0.0.9                    pypi_0    pypi
pytz                      2024.1                   pypi_0    pypi
pyyaml                    6.0.1                    pypi_0    pypi
readline                  8.2                  h5eee18b_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
referencing               0.35.1                   pypi_0    pypi
regex                     2024.5.15                pypi_0    pypi
requests                  2.32.3                   pypi_0    pypi
rich                      13.7.1                   pypi_0    pypi
rpds-py                   0.18.1                   pypi_0    pypi
ruff                      0.4.7                    pypi_0    pypi
sacrebleu                 2.3.2                    pypi_0    pypi
safetensors               0.4.3                    pypi_0    pypi
seaborn                   0.13.0                   pypi_0    pypi
semantic-version          2.10.0                   pypi_0    pypi
sentencepiece             0.1.99                   pypi_0    pypi
setuptools                69.5.1          py310h06a4308_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
shellingham               1.5.4                    pypi_0    pypi
shortuuid                 1.0.11                   pypi_0    pypi
six                       1.16.0                   pypi_0    pypi
smart-open                6.4.0                    pypi_0    pypi
sniffio                   1.3.1                    pypi_0    pypi
socksio                   1.0.0                    pypi_0    pypi
spacy                     3.7.2                    pypi_0    pypi
spacy-legacy              3.0.12                   pypi_0    pypi
spacy-loggers             1.0.5                    pypi_0    pypi
sqlite                    3.45.3               h5eee18b_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
srsly                     2.4.8                    pypi_0    pypi
starlette                 0.37.2                   pypi_0    pypi
sympy                     1.12.1                   pypi_0    pypi
sysroot_linux-64          2.12                he073ed8_17    conda-forge
tabulate                  0.9.0                    pypi_0    pypi
tensorboard               2.16.2                   pypi_0    pypi
tensorboard-data-server   0.7.2                    pypi_0    pypi
tensorboardx              1.8                      pypi_0    pypi
termcolor                 2.4.0                    pypi_0    pypi
thinc                     8.2.3                    pypi_0    pypi
timm                      0.9.10                   pypi_0    pypi
tk                        8.6.14               h39e8969_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
tokenizers                0.19.1                   pypi_0    pypi
tomlkit                   0.12.0                   pypi_0    pypi
toolz                     0.12.1                   pypi_0    pypi
torch                     2.1.2+cu118              pypi_0    pypi
torchaudio                2.1.2+cu118              pypi_0    pypi
torchvision               0.16.2+cu118             pypi_0    pypi
tqdm                      4.66.1                   pypi_0    pypi
transformers              4.41.2                   pypi_0    pypi
triton                    2.1.0                    pypi_0    pypi
typer                     0.9.4                    pypi_0    pypi
typing-extensions         4.8.0                    pypi_0    pypi
tzdata                    2024.1                   pypi_0    pypi
urllib3                   2.2.1                    pypi_0    pypi
uvicorn                   0.24.0.post1             pypi_0    pypi
wasabi                    1.1.3                    pypi_0    pypi
wcwidth                   0.2.13                   pypi_0    pypi
weasel                    0.3.4                    pypi_0    pypi
websockets                11.0.3                   pypi_0    pypi
werkzeug                  3.0.3                    pypi_0    pypi
wheel                     0.43.0          py310h06a4308_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
xz                        5.4.6                h5eee18b_1    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
zlib                      1.2.13               h5eee18b_1    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main

备注 | Anything else?

输出:
prepare trainer
Training dataset length: 1
Validation dataset length: 1
<class 'trainer.CPMTrainer'>
trainer ok

错误信息:
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Detected kernel version 3.10.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
max_steps is given, it will override any value given in num_train_epochs
max_steps is given, it will override any value given in num_train_epochs
max_steps is given, it will override any value given in num_train_epochs
max_steps is given, it will override any value given in num_train_epochs
Using /public/home/lzu2/.cache/torch_extensions/py310_cu118 as PyTorch extensions root...
Using /public/home/lzu2/.cache/torch_extensions/py310_cu118 as PyTorch extensions root...
Using /public/home/lzu2/.cache/torch_extensions/py310_cu118 as PyTorch extensions root...
Using /public/home/lzu2/.cache/torch_extensions/py310_cu118 as PyTorch extensions root...

部分代码:

# 检查数据集长度
print(f"Training dataset length: {len(data_module['train_dataset'])}")
print(f"Validation dataset length: {len(data_module['eval_dataset'])}")


rank0_print("prepare trainer")

trainer = CPMTrainer(
    model=model,
    tokenizer=tokenizer,
    args=training_args,
    **data_module,
)

rank0_print(type(trainer))

rank0_print("trainer ok")

trainer.train()

trainer.save_state()

rank0_print("trainer sucess")

你的数据集有多大呢?

你的数据集有多大呢?

你的数据集有多大呢?

数据集只有一条数据,是官方demo提供的
如下:

[
    {
        "id": "0",
        "image": "path/image/image_0.jpg",
        "conversations": [
            {
              "role": "user", 
              "content": "<image>\nHow many desserts are on the white plate?"
            }, 
            {
                "role": "assistant", 
                "content": "There are three desserts on the white plate."
            },   
            {
                "role": "user", 
                "content": "What type of desserts are they?"
            },
            {
                "role": "assistant", 
                "content": "The desserts are cakes with bananas and pecans on top. They share similarities with donuts, but the presence of bananas and pecans differentiates them."
            }, 
            {
                "role": "user", 
                "content": "What is the setting of the image?"
            }, 
            {
                "role": "assistant", 
                "content": "The image is set on a table top with a plate containing the three desserts."
            }
        ]
    }
]

我的环境是这样的 你可以参考一下
requirements.txt
我的linux内核版本是5.4.0,不知道是不是因为你的版本是3.10.0导致的

我的环境是这样的 你可以参考一下 requirements.txt 我的linux内核版本是5.4.0,不知道是不是因为你的版本是3.10.0导致的

感谢您的回答,我发现好像是我NCCL没有安装的原因,但是我安装以后又出现了新的问题,您能帮我看一下吗?谢谢
错误信息:
Traceback (most recent call last):
File "/LLM/openbmb/MiniCPM-V/finetune/finetune.py", line 708, in
train()
File "/LLM/openbmb/MiniCPM-V/finetune/finetune.py", line 690, in train
trainer.train()
File "/envs/llm/lib/python3.10/site-packages/transformers/trainer.py", line 1885, in train
return inner_training_loop(
File "/envs/llm/lib/python3.10/site-packages/transformers/trainer.py", line 2045, in _inner_training_loop
model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(
File "/envs/llm/lib/python3.10/site-packages/accelerate/accelerator.py", line 1284, in prepare
result = self._prepare_deepspeed(*args)
File "/envs/llm/lib/python3.10/site-packages/accelerate/accelerator.py", line 1751, in _prepare_deepspeed
engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
File "/envs/llm/lib/python3.10/site-packages/deepspeed/init.py", line 181, in initialize
engine = DeepSpeedEngine(args=args,
File "/envs/llm/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 262, in init
self._configure_distributed_model(model)
File "/envs/llm/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1149, in _configure_distributed_model
self._broadcast_model()
File "/envs/llm/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1069, in _broadcast_model
dist.broadcast(p.data, groups._get_broadcast_src_rank(), group=self.seq_data_parallel_group)
File "/envs/llm/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper
return func(*args, **kwargs)
File "/envs/llm/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 224, in broadcast
return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)
File "/envs/llm/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 328, in _fn
return fn(*args, **kwargs)
File "/envs/llm/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 199, in broadcast
return torch.distributed.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)
File "/envs/llm/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
return func(*args, **kwargs)
File "/envs/llm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1910, in broadcast
work = group.broadcast([tensor], opts)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1251, internal error - please report this issue to the NCCL developers, NCCL version 2.18.6
ncclInternalError: Internal check failed.
Last error:
Bootstrap : no socket interface found
[2024-06-14 16:05:19,306] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 30898 closing signal SIGTERM
[2024-06-14 16:05:19,306] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 30899 closing signal SIGTERM
[2024-06-14 16:05:19,307] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 30900 closing signal SIGTERM
[2024-06-14 16:05:20,135] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 30897) of binary: /envs/llm/bin/python

输出信息:
prepare trainer
trainer ok
[2024-06-14 16:05:12,128] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.14.3, git-hash=unknown, git-branch=unknown
gpu009:30897:30897 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ens1f1
gpu009:30897:30897 [0] NCCL INFO NCCL_SOCKET_IFNAME set to ens1f1
gpu009:30897:30897 [0] bootstrap.cc:45 NCCL WARN Bootstrap : no socket interface found
gpu009:30897:30897 [0] NCCL INFO init.cc:82 -> 3
gpu009:30897:30897 [0] NCCL INFO init.cc:101 -> 3

从日志信息看,我好像是没有正确设置网络接口,但是我使用ifconfig命令查找的时候是有ens1f1这个接口的,并且也可以ping通。
麻烦您帮我看一下,谢谢!!!

我的环境是这样的 你可以参考一下 requirements.txt 我的linux内核版本是5.4.0,不知道是不是因为你的版本是3.10.0导致的

感谢您的回答,我发现好像是我NCCL没有安装的原因,但是我安装以后又出现了新的问题,您能帮我看一下吗?谢谢 错误信息: Traceback (most recent call last): File "/LLM/openbmb/MiniCPM-V/finetune/finetune.py", line 708, in train() File "/LLM/openbmb/MiniCPM-V/finetune/finetune.py", line 690, in train trainer.train() File "/envs/llm/lib/python3.10/site-packages/transformers/trainer.py", line 1885, in train return inner_training_loop( File "/envs/llm/lib/python3.10/site-packages/transformers/trainer.py", line 2045, in _inner_training_loop model, self.optimizer, self.lr_scheduler = self.accelerator.prepare( File "/envs/llm/lib/python3.10/site-packages/accelerate/accelerator.py", line 1284, in prepare result = self._prepare_deepspeed(*args) File "/envs/llm/lib/python3.10/site-packages/accelerate/accelerator.py", line 1751, in _prepare_deepspeed engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs) File "/envs/llm/lib/python3.10/site-packages/deepspeed/init.py", line 181, in initialize engine = DeepSpeedEngine(args=args, File "/envs/llm/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 262, in init self._configure_distributed_model(model) File "/envs/llm/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1149, in _configure_distributed_model self._broadcast_model() File "/envs/llm/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1069, in _broadcast_model dist.broadcast(p.data, groups._get_broadcast_src_rank(), group=self.seq_data_parallel_group) File "/envs/llm/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper return func(*args, **kwargs) File "/envs/llm/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 224, in broadcast return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) File "/envs/llm/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 328, in _fn return fn(*args, **kwargs) File "/envs/llm/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 199, in broadcast return torch.distributed.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) File "/envs/llm/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper return func(*args, **kwargs) File "/envs/llm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1910, in broadcast work = group.broadcast([tensor], opts) torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1251, internal error - please report this issue to the NCCL developers, NCCL version 2.18.6 ncclInternalError: Internal check failed. Last error: Bootstrap : no socket interface found [2024-06-14 16:05:19,306] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 30898 closing signal SIGTERM [2024-06-14 16:05:19,306] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 30899 closing signal SIGTERM [2024-06-14 16:05:19,307] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 30900 closing signal SIGTERM [2024-06-14 16:05:20,135] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 30897) of binary: /envs/llm/bin/python

输出信息: prepare trainer trainer ok [2024-06-14 16:05:12,128] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.14.3, git-hash=unknown, git-branch=unknown gpu009:30897:30897 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ens1f1 gpu009:30897:30897 [0] NCCL INFO NCCL_SOCKET_IFNAME set to ens1f1 gpu009:30897:30897 [0] bootstrap.cc:45 NCCL WARN Bootstrap : no socket interface found gpu009:30897:30897 [0] NCCL INFO init.cc:82 -> 3 gpu009:30897:30897 [0] NCCL INFO init.cc:101 -> 3

从日志信息看,我好像是没有正确设置网络接口,但是我使用ifconfig命令查找的时候是有ens1f1这个接口的,并且也可以ping通。 麻烦您帮我看一下,谢谢!!!

当我把网络接口切换到ib0的时候,它不会报错,但是根据NCCL日志信息,它还是处于挂起状态,没有训练
输出信息:
prepare trainer
Training dataset length: 1
Validation dataset length: 1
trainer ok
[2024-06-14 16:59:42,697] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.14.3, git-hash=unknown, git-branch=unknown
gpu009:47867:47867 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ib0
gpu009:47867:47867 [0] NCCL INFO NCCL_SOCKET_IFNAME set to ib0
gpu009:47867:47867 [0] NCCL INFO Bootstrap : Using ib0:11.11.8.9<0>
gpu009:47867:47867 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
gpu009:47867:47867 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation
gpu009:47868:47868 [1] NCCL INFO cudaDriverVersion 12000
gpu009:47869:47869 [2] NCCL INFO cudaDriverVersion 12000
gpu009:47868:47868 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ib0
gpu009:47868:47868 [1] NCCL INFO NCCL_SOCKET_IFNAME set to ib0
gpu009:47869:47869 [2] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ib0
gpu009:47869:47869 [2] NCCL INFO NCCL_SOCKET_IFNAME set to ib0
gpu009:47870:47870 [3] NCCL INFO cudaDriverVersion 12000
gpu009:47868:47868 [1] NCCL INFO Bootstrap : Using ib0:11.11.8.9<0>
gpu009:47869:47869 [2] NCCL INFO Bootstrap : Using ib0:11.11.8.9<0>
gpu009:47869:47869 [2] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
gpu009:47868:47868 [1] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
gpu009:47869:47869 [2] NCCL INFO NET/Plugin : No plugin found, using internal implementation
gpu009:47868:47868 [1] NCCL INFO NET/Plugin : No plugin found, using internal implementation
gpu009:47870:47870 [3] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ib0
gpu009:47870:47870 [3] NCCL INFO NCCL_SOCKET_IFNAME set to ib0
gpu009:47869:47869 [2] NCCL INFO init.cc:1584 Cuda Host Alloc Size 4 pointer 0x7fc46e800000
gpu009:47868:47868 [1] NCCL INFO init.cc:1584 Cuda Host Alloc Size 4 pointer 0x7fc390800000
gpu009:47870:47870 [3] NCCL INFO Bootstrap : Using ib0:11.11.8.9<0>
gpu009:47870:47870 [3] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
gpu009:47870:47870 [3] NCCL INFO NET/Plugin : No plugin found, using internal implementation
gpu009:47870:47870 [3] NCCL INFO init.cc:1584 Cuda Host Alloc Size 4 pointer 0x7fbfc0800000
gpu009:47869:48590 [2] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
gpu009:47869:48590 [2] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ib0
gpu009:47869:48590 [2] NCCL INFO NET/Socket : Using [0]ib0:11.11.8.9<0>
gpu009:47869:48590 [2] NCCL INFO Using network Socket
gpu009:47868:48591 [1] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
gpu009:47868:48591 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ib0
gpu009:47867:47867 [0] NCCL INFO cudaDriverVersion 12000
NCCL version 2.18.6+cuda12.1
gpu009:47868:48591 [1] NCCL INFO NET/Socket : Using [0]ib0:11.11.8.9<0>
gpu009:47868:48591 [1] NCCL INFO Using network Socket
gpu009:47870:48592 [3] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
gpu009:47867:47867 [0] NCCL INFO init.cc:1584 Cuda Host Alloc Size 4 pointer 0x7f2adc800000
gpu009:47870:48592 [3] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ib0
gpu009:47870:48592 [3] NCCL INFO NET/Socket : Using [0]ib0:11.11.8.9<0>
gpu009:47870:48592 [3] NCCL INFO Using network Socket
gpu009:47867:48593 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
gpu009:47867:48593 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ib0
gpu009:47867:48593 [0] NCCL INFO NET/Socket : Using [0]ib0:11.11.8.9<0>
gpu009:47867:48593 [0] NCCL INFO Using network Socket
gpu009:47867:48593 [0] NCCL INFO comm 0x7ef5c880 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 50000 commId 0x361b5540a6088610 - Init START
gpu009:47870:48592 [3] NCCL INFO comm 0x68da1c00 rank 3 nranks 4 cudaDev 3 nvmlDev 3 busId 9c000 commId 0x361b5540a6088610 - Init START
gpu009:47869:48590 [2] NCCL INFO comm 0x69374b40 rank 2 nranks 4 cudaDev 2 nvmlDev 2 busId 57000 commId 0x361b5540a6088610 - Init START
gpu009:47868:48591 [1] NCCL INFO comm 0x68b107c0 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 53000 commId 0x361b5540a6088610 - Init START
gpu009:47870:48592 [3] NCCL INFO NET/Socket : GPU Direct RDMA Disabled for HCA 0 'ib0'
gpu009:47868:48591 [1] NCCL INFO NET/Socket : GPU Direct RDMA Disabled for HCA 0 'ib0'
gpu009:47869:48590 [2] NCCL INFO NET/Socket : GPU Direct RDMA Disabled for HCA 0 'ib0'
gpu009:47867:48593 [0] NCCL INFO NET/Socket : GPU Direct RDMA Disabled for HCA 0 'ib0'
gpu009:47870:48592 [3] NCCL INFO NCCL_P2P_LEVEL set by environment to NVL
gpu009:47870:48592 [3] NCCL INFO === System : maxBw 24.0 totalBw 24.0 ===
gpu009:47870:48592 [3] NCCL INFO CPU/0 (1/1/2)
gpu009:47870:48592 [3] NCCL INFO + PCI[24.0] - PCI/4B000 (1000c01010000000)
gpu009:47870:48592 [3] NCCL INFO + PCI[24.0] - GPU/50000 (0)
gpu009:47870:48592 [3] NCCL INFO + PCI[24.0] - GPU/53000 (1)
gpu009:47870:48592 [3] NCCL INFO + PCI[24.0] - NIC/56000
gpu009:47870:48592 [3] NCCL INFO + PCI[24.0] - GPU/57000 (2)
gpu009:47870:48592 [3] NCCL INFO + SYS[10.0] - CPU/1
gpu009:47870:48592 [3] NCCL INFO CPU/1 (1/1/2)
gpu009:47870:48592 [3] NCCL INFO + PCI[24.0] - PCI/98000 (1000c01010000000)
gpu009:47870:48592 [3] NCCL INFO + PCI[24.0] - GPU/9C000 (3)
gpu009:47870:48592 [3] NCCL INFO + SYS[10.0] - CPU/0
gpu009:47870:48592 [3] NCCL INFO ==========================================
gpu009:47870:48592 [3] NCCL INFO GPU/50000 :GPU/50000 (0/5000.000000/LOC) GPU/53000 (4/24.000000/PHB) GPU/57000 (4/24.000000/PHB) GPU/9C000 (5/10.000000/SYS) CPU/0 (2/24.000000/PHB) CPU/1 (3/10.000000/SYS)
gpu009:47868:48591 [1] NCCL INFO NCCL_P2P_LEVEL set by environment to NVL
gpu009:47870:48592 [3] NCCL INFO GPU/53000 :GPU/50000 (4/24.000000/PHB) GPU/53000 (0/5000.000000/LOC) GPU/57000 (4/24.000000/PHB) GPU/9C000 (5/10.000000/SYS) CPU/0 (2/24.000000/PHB) CPU/1 (3/10.000000/SYS)
gpu009:47870:48592 [3] NCCL INFO GPU/57000 :GPU/50000 (4/24.000000/PHB) GPU/53000 (4/24.000000/PHB) GPU/57000 (0/5000.000000/LOC) GPU/9C000 (5/10.000000/SYS) CPU/0 (2/24.000000/PHB) CPU/1 (3/10.000000/SYS)
gpu009:47870:48592 [3] NCCL INFO GPU/9C000 :GPU/50000 (5/10.000000/SYS) GPU/53000 (5/10.000000/SYS) GPU/57000 (5/10.000000/SYS) GPU/9C000 (0/5000.000000/LOC) CPU/0 (3/10.000000/SYS) CPU/1 (2/24.000000/PHB)
gpu009:47870:48592 [3] NCCL INFO Setting affinity for GPU 3 to 3ff00000,0000003f,f0000000
gpu009:47870:48592 [3] NCCL INFO NVLS multicast support is not available on dev 3
gpu009:47869:48590 [2] NCCL INFO NCCL_P2P_LEVEL set by environment to NVL
gpu009:47867:48593 [0] NCCL INFO NCCL_P2P_LEVEL set by environment to NVL
gpu009:47868:48591 [1] NCCL INFO === System : maxBw 24.0 totalBw 24.0 ===
gpu009:47868:48591 [1] NCCL INFO CPU/0 (1/1/2)
gpu009:47868:48591 [1] NCCL INFO + PCI[24.0] - PCI/4B000 (1000c01010000000)
gpu009:47868:48591 [1] NCCL INFO + PCI[24.0] - GPU/50000 (0)
gpu009:47868:48591 [1] NCCL INFO + PCI[24.0] - GPU/53000 (1)
gpu009:47868:48591 [1] NCCL INFO + PCI[24.0] - NIC/56000
gpu009:47868:48591 [1] NCCL INFO + PCI[24.0] - GPU/57000 (2)
gpu009:47868:48591 [1] NCCL INFO + SYS[10.0] - CPU/1
gpu009:47868:48591 [1] NCCL INFO CPU/1 (1/1/2)
gpu009:47868:48591 [1] NCCL INFO + PCI[24.0] - PCI/98000 (1000c01010000000)
gpu009:47868:48591 [1] NCCL INFO + PCI[24.0] - GPU/9C000 (3)
gpu009:47869:48590 [2] NCCL INFO === System : maxBw 24.0 totalBw 24.0 ===
gpu009:47868:48591 [1] NCCL INFO + SYS[10.0] - CPU/0
gpu009:47867:48593 [0] NCCL INFO === System : maxBw 24.0 totalBw 24.0 ===
gpu009:47869:48590 [2] NCCL INFO CPU/0 (1/1/2)
gpu009:47868:48591 [1] NCCL INFO ==========================================
gpu009:47867:48593 [0] NCCL INFO CPU/0 (1/1/2)
gpu009:47869:48590 [2] NCCL INFO + PCI[24.0] - PCI/4B000 (1000c01010000000)
gpu009:47868:48591 [1] NCCL INFO GPU/50000 :GPU/50000 (0/5000.000000/LOC) GPU/53000 (4/24.000000/PHB) GPU/57000 (4/24.000000/PHB) GPU/9C000 (5/10.000000/SYS) CPU/0 (2/24.000000/PHB) CPU/1 (3/10.000000/SYS)
gpu009:47867:48593 [0] NCCL INFO + PCI[24.0] - PCI/4B000 (1000c01010000000)
gpu009:47869:48590 [2] NCCL INFO + PCI[24.0] - GPU/50000 (0)
gpu009:47868:48591 [1] NCCL INFO GPU/53000 :GPU/50000 (4/24.000000/PHB) GPU/53000 (0/5000.000000/LOC) GPU/57000 (4/24.000000/PHB) GPU/9C000 (5/10.000000/SYS) CPU/0 (2/24.000000/PHB) CPU/1 (3/10.000000/SYS)
gpu009:47867:48593 [0] NCCL INFO + PCI[24.0] - GPU/50000 (0)
gpu009:47869:48590 [2] NCCL INFO + PCI[24.0] - GPU/53000 (1)
gpu009:47868:48591 [1] NCCL INFO GPU/57000 :GPU/50000 (4/24.000000/PHB) GPU/53000 (4/24.000000/PHB) GPU/57000 (0/5000.000000/LOC) GPU/9C000 (5/10.000000/SYS) CPU/0 (2/24.000000/PHB) CPU/1 (3/10.000000/SYS)
gpu009:47867:48593 [0] NCCL INFO + PCI[24.0] - GPU/53000 (1)
gpu009:47869:48590 [2] NCCL INFO + PCI[24.0] - NIC/56000
gpu009:47868:48591 [1] NCCL INFO GPU/9C000 :GPU/50000 (5/10.000000/SYS) GPU/53000 (5/10.000000/SYS) GPU/57000 (5/10.000000/SYS) GPU/9C000 (0/5000.000000/LOC) CPU/0 (3/10.000000/SYS) CPU/1 (2/24.000000/PHB)
gpu009:47867:48593 [0] NCCL INFO + PCI[24.0] - NIC/56000
gpu009:47869:48590 [2] NCCL INFO + PCI[24.0] - GPU/57000 (2)
gpu009:47870:48592 [3] NCCL INFO Pattern 4, crossNic 0, nChannels 1, bw 7.000000/7.000000, type SYS/PIX, sameChannels 1
gpu009:47868:48591 [1] NCCL INFO Setting affinity for GPU 1 to 03ff,00000000,0003ff00
gpu009:47867:48593 [0] NCCL INFO + PCI[24.0] - GPU/57000 (2)
gpu009:47869:48590 [2] NCCL INFO + SYS[10.0] - CPU/1
gpu009:47870:48592 [3] NCCL INFO 0 : GPU/0 GPU/1 GPU/2 GPU/3
gpu009:47868:48591 [1] NCCL INFO NVLS multicast support is not available on dev 1
gpu009:47867:48593 [0] NCCL INFO + SYS[10.0] - CPU/1
gpu009:47869:48590 [2] NCCL INFO CPU/1 (1/1/2)
gpu009:47867:48593 [0] NCCL INFO CPU/1 (1/1/2)
gpu009:47869:48590 [2] NCCL INFO + PCI[24.0] - PCI/98000 (1000c01010000000)
gpu009:47867:48593 [0] NCCL INFO + PCI[24.0] - PCI/98000 (1000c01010000000)
gpu009:47870:48592 [3] NCCL INFO Pattern 1, crossNic 0, nChannels 1, bw 10.000000/10.000000, type SYS/PIX, sameChannels 1
gpu009:47869:48590 [2] NCCL INFO + PCI[24.0] - GPU/9C000 (3)
gpu009:47867:48593 [0] NCCL INFO + PCI[24.0] - GPU/9C000 (3)
gpu009:47870:48592 [3] NCCL INFO 0 : GPU/0 GPU/1 GPU/3 GPU/2
gpu009:47869:48590 [2] NCCL INFO + SYS[10.0] - CPU/0
gpu009:47867:48593 [0] NCCL INFO + SYS[10.0] - CPU/0
gpu009:47869:48590 [2] NCCL INFO ==========================================
gpu009:47867:48593 [0] NCCL INFO ==========================================
gpu009:47869:48590 [2] NCCL INFO GPU/50000 :GPU/50000 (0/5000.000000/LOC) GPU/53000 (4/24.000000/PHB) GPU/57000 (4/24.000000/PHB) GPU/9C000 (5/10.000000/SYS) CPU/0 (2/24.000000/PHB) CPU/1 (3/10.000000/SYS)
gpu009:47867:48593 [0] NCCL INFO GPU/50000 :GPU/50000 (0/5000.000000/LOC) GPU/53000 (4/24.000000/PHB) GPU/57000 (4/24.000000/PHB) GPU/9C000 (5/10.000000/SYS) CPU/0 (2/24.000000/PHB) CPU/1 (3/10.000000/SYS)
gpu009:47869:48590 [2] NCCL INFO GPU/53000 :GPU/50000 (4/24.000000/PHB) GPU/53000 (0/5000.000000/LOC) GPU/57000 (4/24.000000/PHB) GPU/9C000 (5/10.000000/SYS) CPU/0 (2/24.000000/PHB) CPU/1 (3/10.000000/SYS)
gpu009:47867:48593 [0] NCCL INFO GPU/53000 :GPU/50000 (4/24.000000/PHB) GPU/53000 (0/5000.000000/LOC) GPU/57000 (4/24.000000/PHB) GPU/9C000 (5/10.000000/SYS) CPU/0 (2/24.000000/PHB) CPU/1 (3/10.000000/SYS)
gpu009:47869:48590 [2] NCCL INFO GPU/57000 :GPU/50000 (4/24.000000/PHB) GPU/53000 (4/24.000000/PHB) GPU/57000 (0/5000.000000/LOC) GPU/9C000 (5/10.000000/SYS) CPU/0 (2/24.000000/PHB) CPU/1 (3/10.000000/SYS)
gpu009:47867:48593 [0] NCCL INFO GPU/57000 :GPU/50000 (4/24.000000/PHB) GPU/53000 (4/24.000000/PHB) GPU/57000 (0/5000.000000/LOC) GPU/9C000 (5/10.000000/SYS) CPU/0 (2/24.000000/PHB) CPU/1 (3/10.000000/SYS)
gpu009:47869:48590 [2] NCCL INFO GPU/9C000 :GPU/50000 (5/10.000000/SYS) GPU/53000 (5/10.000000/SYS) GPU/57000 (5/10.000000/SYS) GPU/9C000 (0/5000.000000/LOC) CPU/0 (3/10.000000/SYS) CPU/1 (2/24.000000/PHB)
gpu009:47867:48593 [0] NCCL INFO GPU/9C000 :GPU/50000 (5/10.000000/SYS) GPU/53000 (5/10.000000/SYS) GPU/57000 (5/10.000000/SYS) GPU/9C000 (0/5000.000000/LOC) CPU/0 (3/10.000000/SYS) CPU/1 (2/24.000000/PHB)
gpu009:47869:48590 [2] NCCL INFO Setting affinity for GPU 2 to 03ff,00000000,0003ff00
gpu009:47867:48593 [0] NCCL INFO Setting affinity for GPU 0 to 03ff,00000000,0003ff00
gpu009:47869:48590 [2] NCCL INFO NVLS multicast support is not available on dev 2
gpu009:47867:48593 [0] NCCL INFO NVLS multicast support is not available on dev 0
gpu009:47868:48591 [1] NCCL INFO Pattern 4, crossNic 0, nChannels 1, bw 7.000000/7.000000, type SYS/PIX, sameChannels 1
gpu009:47868:48591 [1] NCCL INFO 0 : GPU/0 GPU/1 GPU/2 GPU/3
gpu009:47868:48591 [1] NCCL INFO Pattern 1, crossNic 0, nChannels 1, bw 10.000000/10.000000, type SYS/PIX, sameChannels 1
gpu009:47868:48591 [1] NCCL INFO 0 : GPU/0 GPU/1 GPU/3 GPU/2
gpu009:47869:48590 [2] NCCL INFO Pattern 4, crossNic 0, nChannels 1, bw 7.000000/7.000000, type SYS/PIX, sameChannels 1
gpu009:47869:48590 [2] NCCL INFO 0 : GPU/0 GPU/1 GPU/2 GPU/3
gpu009:47867:48593 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 1, bw 7.000000/7.000000, type SYS/PIX, sameChannels 1
gpu009:47867:48593 [0] NCCL INFO 0 : GPU/0 GPU/1 GPU/2 GPU/3
gpu009:47869:48590 [2] NCCL INFO Pattern 1, crossNic 0, nChannels 1, bw 10.000000/10.000000, type SYS/PIX, sameChannels 1
gpu009:47869:48590 [2] NCCL INFO 0 : GPU/0 GPU/1 GPU/3 GPU/2
gpu009:47867:48593 [0] NCCL INFO Pattern 1, crossNic 0, nChannels 1, bw 10.000000/10.000000, type SYS/PIX, sameChannels 1
gpu009:47867:48593 [0] NCCL INFO 0 : GPU/0 GPU/1 GPU/3 GPU/2
gpu009:47870:48592 [3] NCCL INFO Ring 00 : 2 -> 3 -> 0
gpu009:47870:48592 [3] NCCL INFO Ring 01 : 2 -> 3 -> 0
gpu009:47868:48591 [1] NCCL INFO Tree 0 : 0 -> 1 -> 3/-1/-1
gpu009:47870:48592 [3] NCCL INFO Trees [0] 2/-1/-1->3->1 [1] 2/-1/-1->3->1
gpu009:47867:48593 [0] NCCL INFO Tree 0 : -1 -> 0 -> 1/-1/-1
gpu009:47868:48591 [1] NCCL INFO Tree 1 : 0 -> 1 -> 3/-1/-1
gpu009:47869:48590 [2] NCCL INFO Ring 00 : 1 -> 2 -> 3
gpu009:47870:48592 [3] NCCL INFO P2P Chunksize set to 131072
gpu009:47867:48593 [0] NCCL INFO Tree 1 : -1 -> 0 -> 1/-1/-1
gpu009:47868:48591 [1] NCCL INFO Ring 00 : 0 -> 1 -> 2
gpu009:47869:48590 [2] NCCL INFO Ring 01 : 1 -> 2 -> 3
gpu009:47868:48591 [1] NCCL INFO Ring 01 : 0 -> 1 -> 2
gpu009:47869:48590 [2] NCCL INFO Trees [0] -1/-1/-1->2->3 [1] -1/-1/-1->2->3
gpu009:47867:48593 [0] NCCL INFO Channel 00/02 : 0 1 2 3
gpu009:47870:48592 [3] NCCL INFO misc/utils.cc:235 memory stack hunk malloc(65536)
gpu009:47868:48591 [1] NCCL INFO Trees [0] 3/-1/-1->1->0 [1] 3/-1/-1->1->0
gpu009:47869:48590 [2] NCCL INFO P2P Chunksize set to 131072
gpu009:47867:48593 [0] NCCL INFO Channel 01/02 : 0 1 2 3
gpu009:47868:48591 [1] NCCL INFO P2P Chunksize set to 131072
gpu009:47867:48593 [0] NCCL INFO Ring 00 : 3 -> 0 -> 1
gpu009:47867:48593 [0] NCCL INFO Ring 01 : 3 -> 0 -> 1
gpu009:47867:48593 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
gpu009:47867:48593 [0] NCCL INFO P2P Chunksize set to 131072
gpu009:47867:48593 [0] NCCL INFO misc/utils.cc:235 memory stack hunk malloc(65536)
gpu009:47868:48591 [1] NCCL INFO misc/utils.cc:235 memory stack hunk malloc(65536)
gpu009:47869:48590 [2] NCCL INFO misc/utils.cc:235 memory stack hunk malloc(65536)
gpu009:47870:48592 [3] NCCL INFO channel.cc:40 Cuda Alloc Size 1536 pointer 0x7fbfc1a00000
gpu009:47870:48592 [3] NCCL INFO channel.cc:43 Cuda Alloc Size 72 pointer 0x7fbfc1a00600
gpu009:47870:48592 [3] NCCL INFO channel.cc:54 Cuda Alloc Size 16 pointer 0x7fbfc1a00800
gpu009:47870:48592 [3] NCCL INFO channel.cc:40 Cuda Alloc Size 1536 pointer 0x7fbfc1a00a00
gpu009:47870:48592 [3] NCCL INFO channel.cc:43 Cuda Alloc Size 72 pointer 0x7fbfc1a01000
gpu009:47870:48592 [3] NCCL INFO channel.cc:54 Cuda Alloc Size 16 pointer 0x7fbfc1a01200
gpu009:47870:48592 [3] NCCL INFO Allocated 9637892 bytes of shared memory in /dev/shm/nccl-AP8lNO
gpu009:47867:48593 [0] NCCL INFO channel.cc:40 Cuda Alloc Size 1536 pointer 0x7f2adda00000
gpu009:47868:48591 [1] NCCL INFO channel.cc:40 Cuda Alloc Size 1536 pointer 0x7fc391a00000