gpu execution error

Question

gpu execution error

twangnh opened this issue 2 years ago · comments

I used the nearest function (ie. from torch_cluster import nearest), but when the input is on cpu, the result is correct, when the input is on gpu, the result is completely run, with very large returned indexes. Could you please help? thanks in advance!

Matthias Fey · Answer 1 · Wed Oct 05 2022 12:09:00 GMT+0800 (China Standard Time)

Do you have a minimal example to reproduce?

twang · Answer 2 · Wed Oct 05 2022 14:36:43 GMT+0800 (China Standard Time)

import torch
from torch_cluster import nearest

torch.manual_seed(12345)

A = torch.randn(100, 3).cuda()
B = torch.randn(100, 3).cuda()
inds = nearest(B, A)

inds will have our of range numbers that is very large
I'm also interested in how fast CUDA should be faster than CPU?

Matthias Fey · Answer 3 · Thu Oct 06 2022 03:33:23 GMT+0800 (China Standard Time)

That's interesting. The following works for me:

import torch
from torch_cluster import nearest

torch.manual_seed(12345)

A = torch.randn(100, 3)
B = torch.randn(100, 3)
inds1 = nearest(B, A)
inds2 = nearest(B.cuda(), A.cuda())
assert torch.equal(inds1, inds2.cpu())

May I ask how you installed torch-cluster on your system?

twang · Answer 4 · Thu Oct 06 2022 09:53:42 GMT+0800 (China Standard Time)

It is strange that the code works for me on the machine that I installed torch_cluster, but it cannot work on other machines (we have a small cluster of machines that shares a same home directory, where I installed the anaconda environment. Every machine can access the same python environment so the python pkgs installed from one machine is shared by all machines), all other python pkgs I installed before can work in this way.

I install it with pip install torch-cluster -f https://data.pyg.org/whl/torch-1.8.1+cu101.html

twang · Answer 5 · Thu Oct 06 2022 10:04:54 GMT+0800 (China Standard Time)

I find it works also on other machines with the same GPU (TITAN RTX), but not on older ones like GeForce GTX TITAN X, Could it be due to some GPU architecture support settings during installation?

twang · Answer 6 · Thu Oct 06 2022 11:02:52 GMT+0800 (China Standard Time)

To try support more GPU models, I tried to git clone https://github.com/rusty1s/pytorch_cluster and add the following in setup.py

        if suffix == 'cuda':
            define_macros += [('WITH_CUDA', None)]
            nvcc_flags = os.getenv('NVCC_FLAGS', '')
            nvcc_flags = [] if nvcc_flags == '' else nvcc_flags.split(' ')
            nvcc_flags += ['--expt-relaxed-constexpr', '-O2']
            nvcc_flags += ["-arch=sm_60",
                "-gencode=arch=compute_60,code=sm_60",
                "-gencode=arch=compute_61,code=sm_61",
                "-gencode=arch=compute_70,code=sm_70",
                "-gencode=arch=compute_75,code=sm_75",]
            extra_compile_args['nvcc'] = nvcc_flags

then install from source by :
pip install -e pytorch_cluster
it is installed sucessfully, however, when importing torch_cluster it raises:

>>> from torch_cluster import nearest
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/wangtao/prj/pytorch_cluster/torch_cluster/__init__.py", line 45, in <module>
    from .rw import random_walk  # noqa
  File "/home/wangtao/prj/pytorch_cluster/torch_cluster/rw.py", line 8, in <module>
    def random_walk(
  File "/home/wangtao/anaconda3_2/envs/deform_seg_env/lib/python3.8/site-packages/torch/jit/_script.py", line 989, in script
    fn = torch._C._jit_script_compile(
RuntimeError:
General Union types are not currently supported. Only Union[T, NoneType] (i.e. Optional[T]) is supported.:
  File "/home/wangtao/prj/pytorch_cluster/torch_cluster/rw.py", line 18
    num_nodes: Optional[int] = None,
    return_edge_indices: bool = False,
) -> Union[Tensor, Tuple[Tensor, Tensor]]:
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
    """Samples random walks of length :obj:`walk_length` from all node indices
    in :obj:`start` in the graph given by :obj:`(row, col)` as described in the

Matthias Fey · Answer 7 · Thu Oct 06 2022 20:28:21 GMT+0800 (China Standard Time)

Can you try to install via pip install --no-index torch-cluster -f https://data.pyg.org/whl/torch-1.8.1+cu101.html (note the --no-index needed for older PyTorch versions)? I guess you are currently building from source due to the old PyTorch version. I am confident pre-built wheels should support a variety of architectures.

twang · Answer 8 · Thu Oct 13 2022 10:40:19 GMT+0800 (China Standard Time)

sorry for the delayed response, it works by using pip install --no-index torch-cluster -f https://data.pyg.org/whl/torch-1.8.1+cu101.html, thanks for your help!

github-actions · Answer 9 · Wed Apr 12 2023 09:00:00 GMT+0800 (China Standard Time)

This issue had no activity for 6 months. It will be closed in 2 weeks unless there is some new activity. Is this issue already resolved?