issue with classification

Question

issue with classification

rabbiahassan opened this issue 3 years ago · comments

Hello !
your work is very interesting.When I tried to put the classification model on training it doesn't show any error,but it does get stuck here and doesn't proceed forward.Please tell me what is this issue?
Thanks for your time.

Mutian Xu (Mino) · Answer 1 · Mon Oct 25 2021 00:16:16 GMT+0800 (China Standard Time)

Hi,

Since our cuda_kernel runs in a multi-thread parallel strategy, only 2 gpus may not be able to serve for the need of paralleled threads.

To solve this issue, you may try to run it on more gpus;
or try smaller batch_size (this probably causes performance drop due to not the best batch_size setting, but need less thread).

emenent_CS · Answer 2 · Mon Oct 25 2021 01:11:18 GMT+0800 (China Standard Time)

Thanks for the response.
I reduced the batch size to even minimum but still it doesn't work.I think this issue is not related to the batch size or heavy computation.I am attaching the memory status of gpu alongwith.
I think it gets stuck somewhere but doesn't show any error.

Mutian Xu (Mino) · Answer 3 · Mon Oct 25 2021 13:32:10 GMT+0800 (China Standard Time)

Ok, if this is the first time you run your classification code, please wait for some time (about 1-2minutes, depending on the hardware) for compiling the CUDA op.

Also, after you finish compiling, if it stucks again at loss.backward caused by the limited threads, please solve it by reducing the batch_size or using more gpus.

emenent_CS · Answer 4 · Mon Oct 25 2021 17:38:15 GMT+0800 (China Standard Time)

Thanks for your response again.
I have reduced batch size to even 4 but still it doesnt work.
I am attaching the gpu usage screenshot as well.I think it gets stuck even before,(because its not even using gpu to the full capacity).

Mutian Xu (Mino) · Answer 5 · Mon Oct 25 2021 17:49:26 GMT+0800 (China Standard Time)

Does it keep stuck? Have you waited for more than 2 minutes?

emenent_CS · Answer 6 · Mon Oct 25 2021 17:50:53 GMT+0800 (China Standard Time)

Yes it does.I have waited for five hours.It just doesn't proceed an inch.

Mutian Xu (Mino) · Answer 7 · Mon Oct 25 2021 18:32:43 GMT+0800 (China Standard Time)

Ok, I have just run the code and the program runs normally with normal speed, while I use 4 3090Ti gpus or 4 2080Ti gpus under original batch_size.

As shown in your picture, I can make sure that the code is OK and you have compiled the cuda lib.

So as I mentioned before, this is caused by the very limited thread provided by your GPU (not only depended on the number of gpus but also the type of gpus).

What you can do now is to run on more gpus or better gpus to support our cuda_kernel.

meiqing0417 · Answer 8 · Sun Jan 02 2022 17:36:18 GMT+0800 (China Standard Time)

Excuse me, I also encountered this problem. Is it solved now?

brunotecgraf · Answer 9 · Wed Jan 19 2022 23:28:01 GMT+0800 (China Standard Time)

Excuse me, I also encountered this problem. Is it solved now?

For me using the pointnet option worked!