Segmentation fault (core dumped) even with Cuda-9.0

Question

Segmentation fault (core dumped) even with Cuda-9.0

YapengTian opened this issue 5 years ago · comments

Thanks for sharing the code!

I ran the training code with CUDA-9.0 under Pytorch-0.3.1-cuda90. But, I still met the bug. Can you tell me which part of the code leads to the bug? I would like to try to address it.

Thanks.

Melon(Xuguang Duan) · Answer 1 · Fri Jan 25 2019 16:27:37 GMT+0800 (China Standard Time)

I am very sorry to hear that for my bugging code. I had been losing my temper for a long time when running the code using CUDA8.0; the bug does not occur every time: sometimes the codes run well, sometimes it went wrong, and even worse, I can not detect the detail position of the bug (I rewrite or comment out several parts I think it would go wrong, but the bug still exist). But the code runs well when I switch my CUDA to 9.0 version. So I used to believe that this is a bug of CUDA8.0 and give up the debugging. So I do not know actually where the bug comes from.

But I think it's my responsibility to fix the bug. Would you like to offer the detail command you use and the logs when the code goes wrong? Then I think we can work together to fix the bug?

YapengTian · Answer 2 · Sat Jan 26 2019 03:55:11 GMT+0800 (China Standard Time)

Error information:
[wsdec(0), 20]: train: epoch[00006], batch[0260/0312], elapsed time=2.8320s, loss: 41.171696, 0.000101
Segmentation fault (core dumped)

Cuda version: CUDA Version 9.0.176

Always core dumped at epoch 6 when running "python train_script/train_final.py --checkpoint_cg runs/test/model/test_00076_01-22-20-11-37.ckp --alias wsdec" and without issue when running pre-training.

Melon(Xuguang Duan) · Answer 3 · Sat Jan 26 2019 14:09:22 GMT+0800 (China Standard Time)

YES, exactly the same situation, the error only occurs when training the final model (I do believe the error occurs at the first several epoch when switch from train_sl to train_cg because of the loaded JAVA METEOR score package, maybe due to subprocess terminating?).

When using CUDA 8.0, I tried using another pretrained model, which sometimes runs well. Have you ever tried another pretrained model? Or try our pretrained model which runs exactly the same result as in the paper and does not encounter any bugs (at least in our case).
BTW, what's the current score, I notice that sometimes the results are very well in the several beginning epochs.

Melon(Xuguang Duan) · Answer 4 · Tue Jan 29 2019 23:05:05 GMT+0800 (China Standard Time)

Hi, are you still there?

YapengTian · Answer 5 · Wed Jan 30 2019 00:39:21 GMT+0800 (China Standard Time)

Hi!

Thanks for the response! Due to the bug will block one GPU, I have no an available GPU to get a try recently. I will work on the CVPR rebuttal in the following two weeks and may further explore the code later. Thanks!

aemrey · Answer 6 · Tue Nov 12 2019 22:49:22 GMT+0800 (China Standard Time)

Hi, thank you for your wonderful work!

I am also experiencing this exact same bug. It always crashes with a segfault at epoch 6, batch 260 using CUDA 9.0.176. Here are the current scores for the latest checkpoint:

Average across all tIoUs
--------------------------------------------------------------------------------
| CIDEr: 17.6971
| Bleu_4: 1.0493
| Bleu_3: 2.2116
| Bleu_2: 4.9127
| Bleu_1: 11.7354
| Precision: 60.4421
| ROUGE_L: 12.2123
| METEOR: 6.0898
| Recall: 33.7766

Let me know if any progress can be made in fixing this bug. I would like to train the model further!

Melon(Xuguang Duan) · Answer 7 · Wed Nov 13 2019 14:22:57 GMT+0800 (China Standard Time)

I have not made any progress on this project currently. Maybe I will return to this project next month, and let you know whether we have made any progress. BTW, I compare your scores with the one in my paper, I think your model is trained not bad. According to my experiment, the best scores are always obtained in the several starting epochs, so I think you can check the result after the first several epochs.

And, thanks for your comment, keep in touch under this issue pls.

aemrey · Answer 8 · Fri Nov 15 2019 21:53:58 GMT+0800 (China Standard Time)

Hi, I wanted to let you know that the training proceeded past epoch 6 after I decreased batch size from the default of 32 to 16. Obvious fix in retrospect!

Melon(Xuguang Duan) · Answer 9 · Fri Dec 13 2019 13:39:48 GMT+0800 (China Standard Time)

@aemrey, See, this is really a strange bug.