microsoft / tutel

Tutel MoE: An Optimized Mixture-of-Experts Implementation

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

INTERNAL ASSERT FAILED

Qicheng-WANG opened this issue · comments

Hi there,
When I ran a quick test "python3 -m tutel.examples.helloworld --batch_size=16", it showed error as follow:
RuntimeError: (true) == (fp != nullptr)INTERNAL ASSERT FAILED at "/ssdisk2/tutel/tutel/custom/custom_kernel.cpp":46, please report a bug to PyTorch. CHECK_EQ fails.
Could you help me fix it?Thanks

It also showed
image
I am using NVIDIA 3090 and CUDA11.3

  1. Does print(torch.cuda.get_arch_list()) include sm_86?
  2. Can you try export USE_NVRTC=1 before running the example?
  3. Are you sure there is no other old CUDA installed so that an old nvcc command was wrongly called for this compilation?
commented
  1. Does print(torch.cuda.get_arch_list()) include sm_86?
  2. Can you try export USE_NVRTC=1 before running the example?
  3. Are you sure there is no other old CUDA installed so that an old nvcc command was wrongly called for this compilation?

Hi! I am running tutel in jetson nano b01 (4GB version)
I also meet problem "RuntimeError: (true) == (fp != nullptr)INTERNAL ASSERT FAILED at "/ssdisk2/tutel/tutel/custom/custom_kernel.cpp".

In the nano computer,
1.print(torch.cuda.get_arch_list() is ['sm_53', 'sm_62', 'sm72']
2. I use export USE_NVRTC=1, but another error occurred.
3. My nvcc version is 10.2.3

This is the problem from Pytorch + CUDA not tutel. You need a pytorch built with at least cu117/118 so that torch.cuda.get_arch_list() should include sm_86.
You also need to update your CUDA SDK (e.g. to 12.0) since NVDIA's new GPU is not compatible with its older NVCC SDK.

CUDA 10.2.3 is too old and it cannot support any new GPU that is above V100 (sm_7x). CUDA 11 should support A100 related types and CUDA 12 should support H100 related types. After upgrading CUDA SDK, please also reinstall pytorch that is built upon at least cu118.