NVIDIA / TransformerEngine

A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization in both training and inference.

Home Page:https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/index.html

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ncclIpcSocketSendFd failed in register_user_buffer_collective(alloc=true), --tp-comm-overlap

jingjie01ai opened this issue · comments

  1. register_user_buffer_collective failed at ncclIpcSocketSendFd(...) if alloc=true
    [error msg]:
    UDS: Sending data over socket /tmp/nccl-socket-3-deadcafebeef failed : Connection refused (111)
    [code]:
    https://github.com/NVIDIA/TransformerEngine/blob/main/transformer_engine/pytorch/csrc/userbuffers/userbuffers-host.cpp#L427

  2. ncclIpcSocketSendFd(...) in create_communicator_grouped2 run success.
    [code]: https://github.com/NVIDIA/TransformerEngine/blob/main/transformer_engine/pytorch/csrc/userbuffers/userbuffers-host.cpp#L279

Questions:
Can I just alloc gpubuffer outside of register_user_buffer_collective(alloc=false)? I tried it and success.
What's the different between alloc buffer inside and outside?