`python3 test_flash_mm.py` got error

Question

`python3 test_flash_mm.py` got error

tiendung opened this issue a year ago · comments

ERROR: CUDA RT call "cudaFuncSetAttribute(kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, shared_memory_size )" in line 695 of file mm/csrc/flashmm/mm_block_fwd_cuda.cu failed with invalid device function (98).
max diff for mm block: tensor(2.0590e-05, device='cuda:0', grad_fn=<SelectBackward0>)
average diff for mm block: tensor(2.9658e-06, device='cuda:0', grad_fn=<MeanBackward0>)
max diff: tensor(0.0003, device='cuda:0')
avg diff: tensor(7.4159e-05, device='cuda:0')

I still can run the trainer and the loss go down,

Dan Fu · Answer 1 · Wed Aug 02 2023 04:00:28 GMT+0800 (China Standard Time)

This is usually a result of a mixmatch in CUDA versions: https://forums.developer.nvidia.com/t/cudalaunchkernel-returned-status-98-invalid-device-function/169958

Can you try it with the NVIDIA PyTorch docker container? https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch

Alex Nguyen · Answer 2 · Thu Aug 03 2023 10:20:09 GMT+0800 (China Standard Time)

is it functional correctly despite of CUDA mismatch? I'm running mm-bert and the loss is going down as usual.

Dan Fu · Answer 3 · Thu Aug 03 2023 10:38:18 GMT+0800 (China Standard Time)

The training loop is falling back to regular PyTorch, so that’s why the loss is going down.

Alex Nguyen · Answer 4 · Thu Aug 03 2023 10:42:55 GMT+0800 (China Standard Time)

I see. Thank @DanFu09

Alex Nguyen · Answer 5 · Sun Aug 06 2023 11:05:07 GMT+0800 (China Standard Time)

May I ask one more question @DanFu09, I wonder how much faster the flash_mm kernel compare to pytorch implementation?