luost26 / diffusion-point-cloud

:thought_balloon: Diffusion Probabilistic Models for 3D Point Cloud Generation (CVPR 2021)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

请问训练和测试都出现RuntimeError: CUDA error: invalid device function错误

Heroism502 opened this issue · comments

commented

你好,请问训练和测试都出现如下错误,是否环境配置原因?

EMD-CD: 0%| | 0/19 [00:00<?, ?it/s]
Traceback (most recent call last):
File "train_ae.py", line 203, in
cd_loss = validate_loss(it)
File "train_ae.py", line 169, in validate_loss
metrics = EMD_CD(all_recons, all_refs, batch_size=args.val_batch_size, accelerated_cd=True)
File "/userhome/point_cloud/diffusion-point-cloud-main/evaluation/evaluation_metrics.py", line 58, in EMD_CD
cd_lst.append(dl.mean(dim=1) + dr.mean(dim=1))
RuntimeError: CUDA error: invalid device function
Segmentation fault (core dumped)

What's your GPU model?

commented

What's your GPU model?

RTX2080

Can confirm getting the same error running on RTX2080

I will release a version that doesn't require compiling CUDA extensions soon.

请问有知道这个问题怎么解决吗?

Hi, I have a quick question, did you able to build StructuralLossesBacken ? If yes, then would you please let me know how ? I am getting a g++ error ( created an issues already but no response yet).

Hi all,

EMD_CD functions are for validation purpose only. The training doesn't rely on them, so you may remove this part of codes in the training script. I will also release a version without them later.

请问有知道这个问题怎么解决吗?
出现这个问题是evaluation目录下面那个要编译的库没有编译好,在Makefile中有个参数:CUDA_ARCH,这个参数是设置GPU的算力,不同的GPU有不同的算力,可以在NVIDIA官网:https://developer.nvidia.com/zh-cn/cuda-gpus#compute查询,然后就是设置自己gpu对应的算力。当然,可以考虑兼容性的问题,不容易崩,具体设置参考:
https://arnon.dk/matching-sm-architectures-arch-and-gencode-for-various-nvidia-cards/

Hi all,

I have uploaded a version that doesn't require CUDA extensions. It depends ONLY on native pytorch operations.
The version also fixes the multi-processing bug in Dataloader.
You may try this new version.
Sorry for the late update.

Thanks!

Hi,

My compilation on CUDA 10.1 succeed and the invalid device function error disappears. I guess this error is caused by a mismatch between 10.0 and 10.1. So you can try to compile the metrics on CUDA 10.1.