[BUG] multi-npu training error：Could not compute stream ID for XXX on device -1

Question

[BUG] multi-npu training error：Could not compute stream ID for XXX on device -1

SolenoidWGT opened this issue 7 months ago · comments

Hi, I am a beginner torch_npu. I am running an eight-NPU llama model pre-training task on an Ascend 910B machine and an error is reported. Can anyone help me?

Error stack：

Traceback (most recent call last):
  File "train.py", line 312, in <module>
    initialize_distributed_env(config=args.config, launcher=args.launcher, master_port=args.port, seed=args.seed, backend=args.backend)
  File "/data/wangguoteng/InternEvo/internlm/utils/timeout.py", line 102, in wrapper
    result = func(*args, **kwargs)
  File "/data/wangguoteng/InternEvo/internlm/initialize/launch.py", line 546, in initialize_distributed_env
    launch_from_torch(config=config, seed=seed, backend=backend)
  File "/data/wangguoteng/InternEvo/internlm/initialize/launch.py", line 517, in launch_from_torch
    seed=seed,
  File "/data/wangguoteng/InternEvo/internlm/initialize/launch.py", line 433, in launch
    warmup_process_group()
  File "/data/wangguoteng/InternEvo/internlm/utils/gputest.py", line 290, in warmup_process_group
    buffer = torch.randn([64], device=internlm_accelerator.current_device_name())
  File "/usr/local/python3.7.5/lib/python3.7/site-packages/torch_npu/utils/device_guard.py", line 45, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/python3.7.5/lib/python3.7/site-packages/torch_npu/utils/torch_funcs.py", line 253, in _randn
    return torch_npu.randn(*args, **kwargs)
RuntimeError: 0INTERNAL ASSERT FAILED at "/usr1/02/workspace/j_3s1HgBoq/pytorch/torch_npu/csrc/core/npu/NPUStream.cpp":143, please report a bug to PyTorch. Could not compute stream ID for 0xffff83a26eb0 on device -1 (something has gone horribly wrong!)
terminate called after throwing an instance of 'c10::Error'
  what():  0INTERNAL ASSERT FAILED at "/usr1/02/workspace/j_3s1HgBoq/pytorch/torch_npu/csrc/core/npu/NPUStream.cpp":143, please report a bug to PyTorch. Could not compute stream ID for 0xffff83a26eb0 on device -1 (something has gone horribly wrong!)
Exception raised from NPUStream_getStreamId at /usr1/02/workspace/j_3s1HgBoq/pytorch/torch_npu/csrc/core/npu/NPUStream.cpp:143 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x8c (0xffff876f6114 in /usr/local/python3.7.5/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xa0 (0xffff876f2418 in /usr/local/python3.7.5/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #2: c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x4c (0xffff876f3d34 in /usr/local/python3.7.5/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #3: <unknown function> + 0x12b328c (0xffff826d328c in /usr/local/python3.7.5/lib/python3.7/site-packages/torch_npu/lib/libtorch_npu.so)
frame #4: c10_npu::getCurrentNPUStream(signed char) + 0x74 (0xffff826d54ec in /usr/local/python3.7.5/lib/python3.7/site-packages/torch_npu/lib/libtorch_npu.so)
frame #5: <unknown function> + 0x12d01b4 (0xffff826f01b4 in /usr/local/python3.7.5/lib/python3.7/site-packages/torch_npu/lib/libtorch_npu.so)
frame #6: c10_npu::NpuSysCtrl::Finalize() + 0xe8 (0xffff826f1c30 in /usr/local/python3.7.5/lib/python3.7/site-packages/torch_npu/lib/libtorch_npu.so)
frame #7: THPModule_npu_shutdown(_object*) + 0x1dc (0xffff83a84594 in /usr/local/python3.7.5/lib/python3.7/site-packages/torch_npu/_C.cpython-37m-aarch64-linux-gnu.so)
<omitting python frames>
frame #17: __libc_start_main + 0xf0 (0xffff8da01724 in /lib64/libc.so.6)
frame #18: /usr/local/python3.7.5/bin/python3.7() [0x400834]

The code location where the error was reported：

import os
def warmup_process_group():
    # Prevent OOM from nccl communication.
    if dist.is_initialized():
        print(f"rank: {os.environ['RANK']}, {internlm_accelerator.current_device_name()}", flush=True)
        buffer = torch.randn([64], device=internlm_accelerator.current_device_name())
        if gpc.is_initialized(ParallelMode.DATA):

Environment information：

Images Version: ascendhub.huawei.com/public-ascendhub/ascend-pytorch:23.0.RC3-1.11.0-centos7
System Info：

Collecting environment information...
PyTorch version: 1.11.0a0+gitbc2c6ed
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: CentOS Linux 7 (AltArch) (aarch64)
GCC version: (GCC) 7.3.1 20180303 (Red Hat 7.3.1-5)
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.17

Python version: 3.7.5 (default, Oct 31 2023, 08:15:39)  [GCC 7.3.1 20180303 (Red Hat 7.3.1-5)] (64-bit runtime)
Python platform: Linux-4.14.0-115.el7a.0.1.aarch64-aarch64-with-centos-7.9.2009-AltArch
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: False

CPU:
Architecture:          aarch64
Byte Order:            Little Endian
CPU(s):                192
On-line CPU(s) list:   0-191
Thread(s) per core:    1
Core(s) per socket:    48
Socket(s):             4
NUMA node(s):          1
Model:                 0
CPU max MHz:           2600.0000
CPU min MHz:           200.0000
BogoMIPS:              200.00
L1d cache:             64K
L1i cache:             64K
L2 cache:              512K
L3 cache:              24576K
NUMA node0 CPU(s):     0-191
Flags:                 fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma dcpop

Versions of relevant libraries:
[pip3] numpy==1.21.6
[pip3] torch==1.11.0
[pip3] torch-npu==1.11.0.post4
[pip3] torchvision==0.12.0
[conda] Could not collect

yunyiyun · Answer 1 · Fri Mar 22 2024 17:19:57 GMT+0800 (China Standard Time)

you may need set_device when you use it for every device