INTERNAL ASSERT FAILED

Question

INTERNAL ASSERT FAILED

Qicheng-WANG opened this issue a year ago · comments

Hi there,
When I ran a quick test "python3 -m tutel.examples.helloworld --batch_size=16", it showed error as follow:
RuntimeError: (true) == (fp != nullptr)INTERNAL ASSERT FAILED at "/ssdisk2/tutel/tutel/custom/custom_kernel.cpp":46, please report a bug to PyTorch. CHECK_EQ fails.
Could you help me fix it?Thanks

Qicheng-WANG · Answer 1 · Tue May 02 2023 15:01:41 GMT+0800 (China Standard Time)

It also showed

I am using NVIDIA 3090 and CUDA11.3

ghostplant · Answer 2 · Wed May 03 2023 00:26:04 GMT+0800 (China Standard Time)

Does print(torch.cuda.get_arch_list()) include sm_86?
Can you try export USE_NVRTC=1 before running the example?
Are you sure there is no other old CUDA installed so that an old nvcc command was wrongly called for this compilation?

azuna · Answer 3 · Fri Aug 11 2023 16:14:36 GMT+0800 (China Standard Time)

Does print(torch.cuda.get_arch_list()) include sm_86?

Can you try export USE_NVRTC=1 before running the example?

Are you sure there is no other old CUDA installed so that an old nvcc command was wrongly called for this compilation?

Hi! I am running tutel in jetson nano b01 (4GB version)
I also meet problem "RuntimeError: (true) == (fp != nullptr)INTERNAL ASSERT FAILED at "/ssdisk2/tutel/tutel/custom/custom_kernel.cpp".

In the nano computer,
1.print(torch.cuda.get_arch_list() is ['sm_53', 'sm_62', 'sm72']
2. I use export USE_NVRTC=1, but another error occurred.
3. My nvcc version is 10.2.3

ghostplant · Answer 4 · Sat Aug 12 2023 14:53:25 GMT+0800 (China Standard Time)

This is the problem from Pytorch + CUDA not tutel. You need a pytorch built with at least cu117/118 so that torch.cuda.get_arch_list() should include sm_86.
You also need to update your CUDA SDK (e.g. to 12.0) since NVDIA's new GPU is not compatible with its older NVCC SDK.

ghostplant · Answer 5 · Sat Aug 12 2023 14:54:22 GMT+0800 (China Standard Time)

CUDA 10.2.3 is too old and it cannot support any new GPU that is above V100 (sm_7x). CUDA 11 should support A100 related types and CUDA 12 should support H100 related types. After upgrading CUDA SDK, please also reinstall pytorch that is built upon at least cu118.