problem with spconv

Question

problem with spconv

muzi2045 opened this issue 4 years ago · comments

Hi, I'm using pytorch1.4 + CUDA10.0 + spconv1.1 when training on the Nuscenes dataset.
But there exists some BUG in sparse conv layer when using spconv lib

have you occurred this problem with spconv 1.0? ( it looks like the error will happen sometimes)
And hope for a solution for using the newest dependent lib.

refer this : traveller59/spconv#74

Jacob Hultman · Answer 1 · Fri Mar 27 2020 20:17:37 GMT+0800 (China Standard Time)

Hi, I have read about this problem but I have never encountered it (my spconv fork is v1.0, based on commit 7342772). Have you noticed any speed gain with spconv v1.1? I have read the cuhash implementation in v1.1 results in much higher memory consumption. I will test this to make sure because I would prefer to use v1.1 as long as it has no memory problems.

If you want to use v1.1, maybe you can try the fix table_size_ = unsigned(floor(max_table_entries * space_usage) + 1) mentioned in the thread you linked?

Liheng · Answer 2 · Fri Mar 27 2020 20:34:21 GMT+0800 (China Standard Time)

Yes, I'm testing the fix code, Hope for no crash in training....

Liheng · Answer 3 · Sat Mar 28 2020 15:47:23 GMT+0800 (China Standard Time)

I am testing spconv1.0 in the fork version https://github.com/jhultman/spconv
and the newest spconv https://github.com/traveller59/spconv when I am training on the same Nuscenes dataset.

Here exist some interesting things:
when using spconv1.1, the unknown BUG in hash_table.cpp will leading to crash sometimes,the fix in hash_table.cpp can't help, But the train batch_size=2 is OK.
when switch to the spconv1.0, the batch_size=2 will leading to CUDA out of memory error.
set the batch_size=1, the GPU memory consume on single 2080TI -->

My Testing environment:
Pytorch1.4
spconv1.0
CUDA10.0
python 3.6.8
Ubuntu 16.04
@jhultman