请问训练和测试都出现RuntimeError: CUDA error: invalid device function错误

Question

请问训练和测试都出现RuntimeError: CUDA error: invalid device function错误

Heroism502 opened this issue 3 years ago · comments

你好，请问训练和测试都出现如下错误，是否环境配置原因？

EMD-CD: 0%| | 0/19 [00:00<?, ?it/s]
Traceback (most recent call last):
File "train_ae.py", line 203, in
cd_loss = validate_loss(it)
File "train_ae.py", line 169, in validate_loss
metrics = EMD_CD(all_recons, all_refs, batch_size=args.val_batch_size, accelerated_cd=True)
File "/userhome/point_cloud/diffusion-point-cloud-main/evaluation/evaluation_metrics.py", line 58, in EMD_CD
cd_lst.append(dl.mean(dim=1) + dr.mean(dim=1))
RuntimeError: CUDA error: invalid device function
Segmentation fault (core dumped)

Shitong Luo · Answer 1 · Mon Jul 12 2021 16:24:33 GMT+0800 (China Standard Time)

What's your GPU model?

LXie · Answer 2 · Mon Jul 12 2021 16:27:10 GMT+0800 (China Standard Time)

What's your GPU model?

RTX2080

Div Garg · Answer 3 · Thu Jul 15 2021 08:21:00 GMT+0800 (China Standard Time)

Can confirm getting the same error running on RTX2080

Shitong Luo · Answer 4 · Thu Jul 15 2021 08:22:19 GMT+0800 (China Standard Time)

I will release a version that doesn't require compiling CUDA extensions soon.

WN1695173791 · Answer 5 · Sat Jul 17 2021 13:03:21 GMT+0800 (China Standard Time)

请问有知道这个问题怎么解决吗？

InfyIT-fa · Answer 6 · Sat Aug 07 2021 00:45:11 GMT+0800 (China Standard Time)

Hi, I have a quick question, did you able to build StructuralLossesBacken ? If yes, then would you please let me know how ? I am getting a g++ error ( created an issues already but no response yet).

Shitong Luo · Answer 7 · Sat Aug 07 2021 00:47:43 GMT+0800 (China Standard Time)

Hi all,

EMD_CD functions are for validation purpose only. The training doesn't rely on them, so you may remove this part of codes in the training script. I will also release a version without them later.

HailongZhou · Answer 8 · Wed Sep 29 2021 17:23:32 GMT+0800 (China Standard Time)

请问有知道这个问题怎么解决吗？
出现这个问题是evaluation目录下面那个要编译的库没有编译好，在Makefile中有个参数：CUDA_ARCH，这个参数是设置GPU的算力，不同的GPU有不同的算力，可以在NVIDIA官网：https://developer.nvidia.com/zh-cn/cuda-gpus#compute查询，然后就是设置自己gpu对应的算力。当然，可以考虑兼容性的问题，不容易崩，具体设置参考：
https://arnon.dk/matching-sm-architectures-arch-and-gencode-for-various-nvidia-cards/

Shitong Luo · Answer 9 · Thu Oct 07 2021 20:16:52 GMT+0800 (China Standard Time)

Hi all,

I have uploaded a version that doesn't require CUDA extensions. It depends ONLY on native pytorch operations.
The version also fixes the multi-processing bug in Dataloader.
You may try this new version.
Sorry for the late update.

Thanks!

Min Zhang · Answer 10 · Tue Apr 26 2022 08:54:13 GMT+0800 (China Standard Time)

Hi,

My compilation on CUDA 10.1 succeed and the invalid device function error disappears. I guess this error is caused by a mismatch between 10.0 and 10.1. So you can try to compile the metrics on CUDA 10.1.