pytorch / FBGEMM

FB (Facebook) + GEMM (General Matrix-Matrix Multiplication) - https://code.fb.com/ml-applications/fbgemm/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

NotImplementedError: Could not run 'fbgemm::asynchronous_complete_cumsum' with arguments from the 'CUDA' backend

DeepakSaini119 opened this issue · comments

torch nightly version 2.1.0.dev20230623+cu118 on Ubuntu 22.04.2. I have made sure that the cuda version(nvcc --V) is 11.8

Installed the nightly version of fbgemm-gpu-nightly successfully with pip message Successfully installed fbgemm-gpu-nightly-2023.6.23
However, a torchrec script that uses jagged_tensor ops from fbgemm fails with error NotImplementedError: Could not run 'fbgemm::block_bucketize_sparse_features' with arguments from the 'CUDA' backend. Any suggestions? TIA.

Hi @DeepakSaini119 , thanks for reaching out to us. Do you have a minimal example of of the torchrec script, or a minimal example code that uses fbgemm_gpu, for reproducing this error?

Please find a minimal code example below

import os
import argparse
import numpy as np
import torch
import torch.multiprocessing as mp
import torch.distributed as dist
import torchrec
from torchrec.modules.embedding_configs import DataType


def setup(rank, args):
    torch.cuda.set_device(rank)

    os.environ['MASTER_ADDR'] = 'localhost'
    os.environ['MASTER_PORT'] = '12355'

    dist.init_process_group("nccl", rank=rank, world_size=args.world_size)

    return None


def train(rank, embedding_tables, args):
    print(f"setting up in {rank}")
    setup(rank, args)

    model = torchrec.distributed.DistributedModelParallel(embedding_tables, device=torch.device(rank))

    if(rank == 0):
        print(model)
        print(model.plan)
        print("named params")
        for name, param in model.named_parameters():
            print(name, param.shape, param)

    label_index = torch.LongTensor([0,1,2]).to(rank)

    minibatch_kjt = torchrec.KeyedJaggedTensor(keys=["free_parameters"],
                                          values=label_index,
                                          lengths=torch.ones(len(label_index), dtype=torch.int64).to(label_index.device))
    free_params = model(minibatch_kjt)["free_parameters"].values() # fails at this line with the said error

    print("free_params:", free_params.shape, free_params)


if(__name__ == "__main__"):
    parser = argparse.ArgumentParser()
    args = parser.parse_args()

    args.nodes = 1
    args.gpus = torch.cuda.device_count()
    args.world_size = args.gpus * args.nodes

    
    embedding_tables = torchrec.EmbeddingCollection(device="cpu",
                                               tables=[
                                                   torchrec.EmbeddingConfig(
                                                       name="free_parameters",
                                                       embedding_dim=64,
                                                       num_embeddings=10,
                                                       data_type=DataType.FP32
                                                       )
                                                   ]
                                               )
    

    lbl_embs = np.array([np.ones((64, ), dtype=np.float32) * i for i in range(10)])
    print(lbl_embs)  # is [[0, 0, 0, ...], [1, 1, 1, ...], [2, 2, 2, ...], ..., [9, 9, 9, ...]]
    
    with torch.no_grad():
        for name, param in embedding_tables.named_parameters():
            if name == 'embeddings.free_parameters.weight':
                print("Intializing table...")
                param.copy_(torch.tensor(lbl_embs))

    print(">>> Spawning processes")
    mp.spawn(train, nprocs=args.gpus, args=(embedding_tables, args,))

I have also tried with CUDA version 11.6 but same issue.

@q10, adding the full stack trace in case it helps in any way

-- Process 2 terminated with the following error:
Traceback (most recent call last):
  File "/home/user/miniconda3/envs/torchrec/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/home/user/TripletClf/TestDDP.py", line 124, in train
    free_params = model(minibatch_kjt)["free_parameters"].values()
  File "/home/user/miniconda3/envs/torchrec/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1505, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/miniconda3/envs/torchrec/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1514, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/TripletClf/./torchrec/torchrec/distributed/model_parallel.py", line 265, in forward
    return self._dmp_wrapped_module(*args, **kwargs)
  File "/home/user/miniconda3/envs/torchrec/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1505, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/miniconda3/envs/torchrec/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1514, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/TripletClf/./torchrec/torchrec/distributed/types.py", line 698, in forward
    dist_input = self.input_dist(ctx, *input, **kwargs).wait().wait()
  File "/home/user/TripletClf/./torchrec/torchrec/distributed/embedding.py", line 681, in input_dist
    awaitables.append(input_dist(features))
  File "/home/user/miniconda3/envs/torchrec/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1505, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/miniconda3/envs/torchrec/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1514, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/TripletClf/./torchrec/torchrec/distributed/sharding/rw_sharding.py", line 257, in forward
    ) = bucketize_kjt_before_all2all(
  File "/home/user/TripletClf/./torchrec/torchrec/distributed/embedding_sharding.py", line 92, in bucketize_kjt_before_all2all
    ) = torch.ops.fbgemm.block_bucketize_sparse_features(
  File "/home/user/miniconda3/envs/torchrec/lib/python3.10/site-packages/torch/_ops.py", line 677, in __call__
    return self._op(*args, **kwargs or {})
NotImplementedError: Could not run 'fbgemm::block_bucketize_sparse_features' with arguments from the 'CUDA' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'fbgemm::block_bucketize_sparse_features' is only available for these backends: [BackendSelect, Python, FuncTorchDynamicLayerBackMode, Functionalize, Named, Conjugate, Negative, ZeroTensor, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradMPS, AutogradXPU, AutogradHPU, AutogradLazy, AutogradMeta, Tracer, AutocastCPU, AutocastCUDA, FuncTorchBatched, FuncTorchVmapMode, Batched, VmapMode, FuncTorchGradWrapper, PythonTLSSnapshot, FuncTorchDynamicLayerFrontMode, PreDispatch, PythonDispatcher].

BackendSelect: fallthrough registered at ../aten/src/ATen/core/BackendSelectFallbackKernel.cpp:3 [backend fallback]
Python: registered at ../aten/src/ATen/core/PythonFallbackKernel.cpp:153 [backend fallback]
FuncTorchDynamicLayerBackMode: registered at ../aten/src/ATen/functorch/DynamicLayer.cpp:498 [backend fallback]
Functionalize: registered at ../aten/src/ATen/FunctionalizeFallbackKernel.cpp:290 [backend fallback]
Named: registered at ../aten/src/ATen/core/NamedRegistrations.cpp:7 [backend fallback]
Conjugate: registered at ../aten/src/ATen/ConjugateFallback.cpp:17 [backend fallback]
Negative: registered at ../aten/src/ATen/native/NegateFallback.cpp:19 [backend fallback]
ZeroTensor: registered at ../aten/src/ATen/ZeroTensorFallback.cpp:86 [backend fallback]
ADInplaceOrView: fallthrough registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:63 [backend fallback]
AutogradOther: fallthrough registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:30 [backend fallback]
AutogradCPU: fallthrough registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:34 [backend fallback]
AutogradCUDA: fallthrough registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:42 [backend fallback]
AutogradXLA: fallthrough registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:46 [backend fallback]
AutogradMPS: fallthrough registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:54 [backend fallback]
AutogradXPU: fallthrough registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:38 [backend fallback]
AutogradHPU: fallthrough registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:67 [backend fallback]
AutogradLazy: fallthrough registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:50 [backend fallback]
AutogradMeta: fallthrough registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:58 [backend fallback]
Tracer: registered at ../torch/csrc/autograd/TraceTypeManual.cpp:296 [backend fallback]
AutocastCPU: fallthrough registered at ../aten/src/ATen/autocast_mode.cpp:383 [backend fallback]
AutocastCUDA: fallthrough registered at ../aten/src/ATen/autocast_mode.cpp:250 [backend fallback]
FuncTorchBatched: registered at ../aten/src/ATen/functorch/LegacyBatchingRegistrations.cpp:710 [backend fallback]
FuncTorchVmapMode: fallthrough registered at ../aten/src/ATen/functorch/VmapModeRegistrations.cpp:28 [backend fallback]
Batched: registered at ../aten/src/ATen/LegacyBatchingRegistrations.cpp:1075 [backend fallback]
VmapMode: fallthrough registered at ../aten/src/ATen/VmapModeRegistrations.cpp:33 [backend fallback]
FuncTorchGradWrapper: registered at ../aten/src/ATen/functorch/TensorWrapper.cpp:201 [backend fallback]
PythonTLSSnapshot: registered at ../aten/src/ATen/core/PythonFallbackKernel.cpp:161 [backend fallback]
FuncTorchDynamicLayerFrontMode: registered at ../aten/src/ATen/functorch/DynamicLayer.cpp:494 [backend fallback]
PreDispatch: registered at ../aten/src/ATen/core/PythonFallbackKernel.cpp:165 [backend fallback]
PythonDispatcher: registered at /__w/FBGEMM/FBGEMM/fbgemm_gpu/src/sparse_ops/sparse_ops_cpu.cpp:2588 [kernel]

We've been able to come up with the most bare minimum example (i.e. no torchrec involved) that produces an error with a similar error signagure:

import torch
import fbgemm_gpu
indices_tensor = torch.tensor([0, 2, 1, 3], device='cuda:0', dtype=torch.int32)
lengths = torch.tensor([[3, 3, 3, 3], [3, 3, 3, 3], [3, 3, 3, 3], [3, 3, 3, 3]], device='cuda:0', dtype=torch.int32)
values = torch.tensor([0, 3, 0, 2, 3, 2, 2, 1, 3, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1,3, 1, 2, 1, 3, 0, 0, 2, 2, 1, 3, 1, 1, 0, 1, 1, 3, 1, 0, 1, 1, 1, 3, 3], device='cuda:0')
permuted_lengths_sum = 12
torch.ops.fbgemm.permute_2D_sparse_data(
                    indices_tensor,
                    lengths,
                    values,
                    None,
                    permuted_lengths_sum,
)

Looking further into this.

Hi @DeepakSaini119, the error doesn't persist on current nightly. If you could update to the current nightly, it should work.

@spcyppt, Hi, still seems to persist for me for fbgemm-gpu-nightly-2023.7.26

@DeepakSaini119 do you get any errors from running the code above? i.e.,

import torch
import fbgemm_gpu
indices_tensor = torch.tensor([0, 2, 1, 3], device='cuda:0', dtype=torch.int32)
lengths = torch.tensor([[3, 3, 3, 3], [3, 3, 3, 3], [3, 3, 3, 3], [3, 3, 3, 3]], device='cuda:0', dtype=torch.int32)
values = torch.tensor([0, 3, 0, 2, 3, 2, 2, 1, 3, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1,3, 1, 2, 1, 3, 0, 0, 2, 2, 1, 3, 1, 1, 0, 1, 1, 3, 1, 0, 1, 1, 1, 3, 3], device='cuda:0')
permuted_lengths_sum = 12
torch.ops.fbgemm.permute_2D_sparse_data(
indices_tensor,
lengths,
values,
None,
permuted_lengths_sum,
)

Hi @spcyppt, the error is

Traceback (most recent call last):
  File "/home/user/TripletClf/TestFBGEMM.py", line 7, in <module>
    torch.ops.fbgemm.permute_2D_sparse_data(
  File "/home/user/miniconda3/envs/torchrec/lib/python3.10/site-packages/torch/_ops.py", line 681, in __call__
    return self._op(*args, **kwargs or {})
NotImplementedError: Could not run 'fbgemm::permute_2D_sparse_data' with arguments from the 'CUDA' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'fbgemm::permute_2D_sparse_data' is only available for these backends: [BackendSelect, Python, FuncTorchDynamicLayerBackMode, Functionalize, Named, Conjugate, Negative, ZeroTensor, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradMPS, AutogradXPU, AutogradHPU, AutogradLazy, AutogradMeta, Tracer, AutocastCPU, AutocastCUDA, FuncTorchBatched, FuncTorchVmapMode, Batched, VmapMode, FuncTorchGradWrapper, PythonTLSSnapshot, FuncTorchDynamicLayerFrontMode, PreDispatch, PythonDispatcher].

Hi @DeepakSaini119, thank you for your input. I will look into this further.

Hi @DeepakSaini119, the workaround on this would be building fbgemm gpu from Github, please see https://github.com/pytorch/FBGEMM/blob/main/fbgemm_gpu/docs/BuildInstructions.md for instruction. We're looking into the the issues on fbgemm-gpu-nightly from pip.

commented

Closing issue since there is a workaround. @DeepakSaini119 please feel free to reopen this if you still run into the issue. Thanks

@spcyppt actually can we keep this issue open? Or open an issue for this problem? Since I want to track when this gets resolved.

@DeepakSaini119 Hi, the issue is due to the CUDA version mismatch. The currently published fbgemm-gpu-nightly is cuda 11.8 now. So installing the latest fbgemm-gpu-nightly should fix the issues. Thank you.