`python3 test_flash_mm.py` got error
tiendung opened this issue · comments
ERROR: CUDA RT call "cudaFuncSetAttribute(kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, shared_memory_size )" in line 695 of file mm/csrc/flashmm/mm_block_fwd_cuda.cu failed with invalid device function (98).
max diff for mm block: tensor(2.0590e-05, device='cuda:0', grad_fn=<SelectBackward0>)
average diff for mm block: tensor(2.9658e-06, device='cuda:0', grad_fn=<MeanBackward0>)
max diff: tensor(0.0003, device='cuda:0')
avg diff: tensor(7.4159e-05, device='cuda:0')
I still can run the trainer and the loss go down,
This is usually a result of a mixmatch in CUDA versions: https://forums.developer.nvidia.com/t/cudalaunchkernel-returned-status-98-invalid-device-function/169958
Can you try it with the NVIDIA PyTorch docker container? https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch
is it functional correctly despite of CUDA mismatch? I'm running mm-bert and the loss is going down as usual.
The training loop is falling back to regular PyTorch, so that’s why the loss is going down.
I see. Thank @DanFu09
May I ask one more question @DanFu09, I wonder how much faster the flash_mm kernel compare to pytorch implementation?