NotImplementedError: Could not run 'fbgemm::asynchronous_complete_cumsum' with arguments from the 'CUDA' backend
DeepakSaini119 opened this issue · comments
torch nightly version 2.1.0.dev20230623+cu118
on Ubuntu 22.04.2
. I have made sure that the cuda version(nvcc --V
) is 11.8
Installed the nightly version of fbgemm-gpu-nightly successfully with pip message Successfully installed fbgemm-gpu-nightly-2023.6.23
However, a torchrec script that uses jagged_tensor ops from fbgemm fails with error NotImplementedError: Could not run 'fbgemm::block_bucketize_sparse_features' with arguments from the 'CUDA' backend
. Any suggestions? TIA.
Hi @DeepakSaini119 , thanks for reaching out to us. Do you have a minimal example of of the torchrec script, or a minimal example code that uses fbgemm_gpu, for reproducing this error?
Please find a minimal code example below
import os
import argparse
import numpy as np
import torch
import torch.multiprocessing as mp
import torch.distributed as dist
import torchrec
from torchrec.modules.embedding_configs import DataType
def setup(rank, args):
torch.cuda.set_device(rank)
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '12355'
dist.init_process_group("nccl", rank=rank, world_size=args.world_size)
return None
def train(rank, embedding_tables, args):
print(f"setting up in {rank}")
setup(rank, args)
model = torchrec.distributed.DistributedModelParallel(embedding_tables, device=torch.device(rank))
if(rank == 0):
print(model)
print(model.plan)
print("named params")
for name, param in model.named_parameters():
print(name, param.shape, param)
label_index = torch.LongTensor([0,1,2]).to(rank)
minibatch_kjt = torchrec.KeyedJaggedTensor(keys=["free_parameters"],
values=label_index,
lengths=torch.ones(len(label_index), dtype=torch.int64).to(label_index.device))
free_params = model(minibatch_kjt)["free_parameters"].values() # fails at this line with the said error
print("free_params:", free_params.shape, free_params)
if(__name__ == "__main__"):
parser = argparse.ArgumentParser()
args = parser.parse_args()
args.nodes = 1
args.gpus = torch.cuda.device_count()
args.world_size = args.gpus * args.nodes
embedding_tables = torchrec.EmbeddingCollection(device="cpu",
tables=[
torchrec.EmbeddingConfig(
name="free_parameters",
embedding_dim=64,
num_embeddings=10,
data_type=DataType.FP32
)
]
)
lbl_embs = np.array([np.ones((64, ), dtype=np.float32) * i for i in range(10)])
print(lbl_embs) # is [[0, 0, 0, ...], [1, 1, 1, ...], [2, 2, 2, ...], ..., [9, 9, 9, ...]]
with torch.no_grad():
for name, param in embedding_tables.named_parameters():
if name == 'embeddings.free_parameters.weight':
print("Intializing table...")
param.copy_(torch.tensor(lbl_embs))
print(">>> Spawning processes")
mp.spawn(train, nprocs=args.gpus, args=(embedding_tables, args,))
I have also tried with CUDA version 11.6 but same issue.
@q10, adding the full stack trace in case it helps in any way
-- Process 2 terminated with the following error:
Traceback (most recent call last):
File "/home/user/miniconda3/envs/torchrec/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/home/user/TripletClf/TestDDP.py", line 124, in train
free_params = model(minibatch_kjt)["free_parameters"].values()
File "/home/user/miniconda3/envs/torchrec/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1505, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/user/miniconda3/envs/torchrec/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1514, in _call_impl
return forward_call(*args, **kwargs)
File "/home/user/TripletClf/./torchrec/torchrec/distributed/model_parallel.py", line 265, in forward
return self._dmp_wrapped_module(*args, **kwargs)
File "/home/user/miniconda3/envs/torchrec/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1505, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/user/miniconda3/envs/torchrec/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1514, in _call_impl
return forward_call(*args, **kwargs)
File "/home/user/TripletClf/./torchrec/torchrec/distributed/types.py", line 698, in forward
dist_input = self.input_dist(ctx, *input, **kwargs).wait().wait()
File "/home/user/TripletClf/./torchrec/torchrec/distributed/embedding.py", line 681, in input_dist
awaitables.append(input_dist(features))
File "/home/user/miniconda3/envs/torchrec/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1505, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/user/miniconda3/envs/torchrec/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1514, in _call_impl
return forward_call(*args, **kwargs)
File "/home/user/TripletClf/./torchrec/torchrec/distributed/sharding/rw_sharding.py", line 257, in forward
) = bucketize_kjt_before_all2all(
File "/home/user/TripletClf/./torchrec/torchrec/distributed/embedding_sharding.py", line 92, in bucketize_kjt_before_all2all
) = torch.ops.fbgemm.block_bucketize_sparse_features(
File "/home/user/miniconda3/envs/torchrec/lib/python3.10/site-packages/torch/_ops.py", line 677, in __call__
return self._op(*args, **kwargs or {})
NotImplementedError: Could not run 'fbgemm::block_bucketize_sparse_features' with arguments from the 'CUDA' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'fbgemm::block_bucketize_sparse_features' is only available for these backends: [BackendSelect, Python, FuncTorchDynamicLayerBackMode, Functionalize, Named, Conjugate, Negative, ZeroTensor, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradMPS, AutogradXPU, AutogradHPU, AutogradLazy, AutogradMeta, Tracer, AutocastCPU, AutocastCUDA, FuncTorchBatched, FuncTorchVmapMode, Batched, VmapMode, FuncTorchGradWrapper, PythonTLSSnapshot, FuncTorchDynamicLayerFrontMode, PreDispatch, PythonDispatcher].
BackendSelect: fallthrough registered at ../aten/src/ATen/core/BackendSelectFallbackKernel.cpp:3 [backend fallback]
Python: registered at ../aten/src/ATen/core/PythonFallbackKernel.cpp:153 [backend fallback]
FuncTorchDynamicLayerBackMode: registered at ../aten/src/ATen/functorch/DynamicLayer.cpp:498 [backend fallback]
Functionalize: registered at ../aten/src/ATen/FunctionalizeFallbackKernel.cpp:290 [backend fallback]
Named: registered at ../aten/src/ATen/core/NamedRegistrations.cpp:7 [backend fallback]
Conjugate: registered at ../aten/src/ATen/ConjugateFallback.cpp:17 [backend fallback]
Negative: registered at ../aten/src/ATen/native/NegateFallback.cpp:19 [backend fallback]
ZeroTensor: registered at ../aten/src/ATen/ZeroTensorFallback.cpp:86 [backend fallback]
ADInplaceOrView: fallthrough registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:63 [backend fallback]
AutogradOther: fallthrough registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:30 [backend fallback]
AutogradCPU: fallthrough registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:34 [backend fallback]
AutogradCUDA: fallthrough registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:42 [backend fallback]
AutogradXLA: fallthrough registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:46 [backend fallback]
AutogradMPS: fallthrough registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:54 [backend fallback]
AutogradXPU: fallthrough registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:38 [backend fallback]
AutogradHPU: fallthrough registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:67 [backend fallback]
AutogradLazy: fallthrough registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:50 [backend fallback]
AutogradMeta: fallthrough registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:58 [backend fallback]
Tracer: registered at ../torch/csrc/autograd/TraceTypeManual.cpp:296 [backend fallback]
AutocastCPU: fallthrough registered at ../aten/src/ATen/autocast_mode.cpp:383 [backend fallback]
AutocastCUDA: fallthrough registered at ../aten/src/ATen/autocast_mode.cpp:250 [backend fallback]
FuncTorchBatched: registered at ../aten/src/ATen/functorch/LegacyBatchingRegistrations.cpp:710 [backend fallback]
FuncTorchVmapMode: fallthrough registered at ../aten/src/ATen/functorch/VmapModeRegistrations.cpp:28 [backend fallback]
Batched: registered at ../aten/src/ATen/LegacyBatchingRegistrations.cpp:1075 [backend fallback]
VmapMode: fallthrough registered at ../aten/src/ATen/VmapModeRegistrations.cpp:33 [backend fallback]
FuncTorchGradWrapper: registered at ../aten/src/ATen/functorch/TensorWrapper.cpp:201 [backend fallback]
PythonTLSSnapshot: registered at ../aten/src/ATen/core/PythonFallbackKernel.cpp:161 [backend fallback]
FuncTorchDynamicLayerFrontMode: registered at ../aten/src/ATen/functorch/DynamicLayer.cpp:494 [backend fallback]
PreDispatch: registered at ../aten/src/ATen/core/PythonFallbackKernel.cpp:165 [backend fallback]
PythonDispatcher: registered at /__w/FBGEMM/FBGEMM/fbgemm_gpu/src/sparse_ops/sparse_ops_cpu.cpp:2588 [kernel]
We've been able to come up with the most bare minimum example (i.e. no torchrec involved) that produces an error with a similar error signagure:
import torch
import fbgemm_gpu
indices_tensor = torch.tensor([0, 2, 1, 3], device='cuda:0', dtype=torch.int32)
lengths = torch.tensor([[3, 3, 3, 3], [3, 3, 3, 3], [3, 3, 3, 3], [3, 3, 3, 3]], device='cuda:0', dtype=torch.int32)
values = torch.tensor([0, 3, 0, 2, 3, 2, 2, 1, 3, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1,3, 1, 2, 1, 3, 0, 0, 2, 2, 1, 3, 1, 1, 0, 1, 1, 3, 1, 0, 1, 1, 1, 3, 3], device='cuda:0')
permuted_lengths_sum = 12
torch.ops.fbgemm.permute_2D_sparse_data(
indices_tensor,
lengths,
values,
None,
permuted_lengths_sum,
)
Looking further into this.
Hi @DeepakSaini119, the error doesn't persist on current nightly. If you could update to the current nightly, it should work.
@spcyppt, Hi, still seems to persist for me for fbgemm-gpu-nightly-2023.7.26
@DeepakSaini119 do you get any errors from running the code above? i.e.,
import torch
import fbgemm_gpu
indices_tensor = torch.tensor([0, 2, 1, 3], device='cuda:0', dtype=torch.int32)
lengths = torch.tensor([[3, 3, 3, 3], [3, 3, 3, 3], [3, 3, 3, 3], [3, 3, 3, 3]], device='cuda:0', dtype=torch.int32)
values = torch.tensor([0, 3, 0, 2, 3, 2, 2, 1, 3, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1,3, 1, 2, 1, 3, 0, 0, 2, 2, 1, 3, 1, 1, 0, 1, 1, 3, 1, 0, 1, 1, 1, 3, 3], device='cuda:0')
permuted_lengths_sum = 12
torch.ops.fbgemm.permute_2D_sparse_data(
indices_tensor,
lengths,
values,
None,
permuted_lengths_sum,
)
Hi @spcyppt, the error is
Traceback (most recent call last):
File "/home/user/TripletClf/TestFBGEMM.py", line 7, in <module>
torch.ops.fbgemm.permute_2D_sparse_data(
File "/home/user/miniconda3/envs/torchrec/lib/python3.10/site-packages/torch/_ops.py", line 681, in __call__
return self._op(*args, **kwargs or {})
NotImplementedError: Could not run 'fbgemm::permute_2D_sparse_data' with arguments from the 'CUDA' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'fbgemm::permute_2D_sparse_data' is only available for these backends: [BackendSelect, Python, FuncTorchDynamicLayerBackMode, Functionalize, Named, Conjugate, Negative, ZeroTensor, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradMPS, AutogradXPU, AutogradHPU, AutogradLazy, AutogradMeta, Tracer, AutocastCPU, AutocastCUDA, FuncTorchBatched, FuncTorchVmapMode, Batched, VmapMode, FuncTorchGradWrapper, PythonTLSSnapshot, FuncTorchDynamicLayerFrontMode, PreDispatch, PythonDispatcher].
Hi @DeepakSaini119, thank you for your input. I will look into this further.
Hi @DeepakSaini119, the workaround on this would be building fbgemm gpu from Github, please see https://github.com/pytorch/FBGEMM/blob/main/fbgemm_gpu/docs/BuildInstructions.md for instruction. We're looking into the the issues on fbgemm-gpu-nightly from pip.
Closing issue since there is a workaround. @DeepakSaini119 please feel free to reopen this if you still run into the issue. Thanks
@spcyppt actually can we keep this issue open? Or open an issue for this problem? Since I want to track when this gets resolved.
@DeepakSaini119 Hi, the issue is due to the CUDA version mismatch. The currently published fbgemm-gpu-nightly is cuda 11.8 now. So installing the latest fbgemm-gpu-nightly
should fix the issues. Thank you.