XgDuan / WSDEC

Weakly Supervised Dense Event Captioning in Videos, i.e. generating multiple sentence descriptions for a video in a weakly-supervised manner.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Segmentation fault (core dumped) even with Cuda-9.0

YapengTian opened this issue · comments

Thanks for sharing the code!

I ran the training code with CUDA-9.0 under Pytorch-0.3.1-cuda90. But, I still met the bug. Can you tell me which part of the code leads to the bug? I would like to try to address it.

Thanks.

I am very sorry to hear that for my bugging code. I had been losing my temper for a long time when running the code using CUDA8.0; the bug does not occur every time: sometimes the codes run well, sometimes it went wrong, and even worse, I can not detect the detail position of the bug (I rewrite or comment out several parts I think it would go wrong, but the bug still exist). But the code runs well when I switch my CUDA to 9.0 version. So I used to believe that this is a bug of CUDA8.0 and give up the debugging. So I do not know actually where the bug comes from.

But I think it's my responsibility to fix the bug. Would you like to offer the detail command you use and the logs when the code goes wrong? Then I think we can work together to fix the bug?

Error information:
[wsdec(0), 20]: train: epoch[00006], batch[0260/0312], elapsed time=2.8320s, loss: 41.171696, 0.000101
Segmentation fault (core dumped)

Cuda version: CUDA Version 9.0.176

Always core dumped at epoch 6 when running "python train_script/train_final.py --checkpoint_cg runs/test/model/test_00076_01-22-20-11-37.ckp --alias wsdec" and without issue when running pre-training.

YES, exactly the same situation, the error only occurs when training the final model (I do believe the error occurs at the first several epoch when switch from train_sl to train_cg because of the loaded JAVA METEOR score package, maybe due to subprocess terminating?).

When using CUDA 8.0, I tried using another pretrained model, which sometimes runs well. Have you ever tried another pretrained model? Or try our pretrained model which runs exactly the same result as in the paper and does not encounter any bugs (at least in our case).
BTW, what's the current score, I notice that sometimes the results are very well in the several beginning epochs.

Hi, are you still there?

Hi!

Thanks for the response! Due to the bug will block one GPU, I have no an available GPU to get a try recently. I will work on the CVPR rebuttal in the following two weeks and may further explore the code later. Thanks!

Hi, thank you for your wonderful work!

I am also experiencing this exact same bug. It always crashes with a segfault at epoch 6, batch 260 using CUDA 9.0.176. Here are the current scores for the latest checkpoint:

Average across all tIoUs
--------------------------------------------------------------------------------
| CIDEr: 17.6971
| Bleu_4: 1.0493
| Bleu_3: 2.2116
| Bleu_2: 4.9127
| Bleu_1: 11.7354
| Precision: 60.4421
| ROUGE_L: 12.2123
| METEOR: 6.0898
| Recall: 33.7766

Let me know if any progress can be made in fixing this bug. I would like to train the model further!

I have not made any progress on this project currently. Maybe I will return to this project next month, and let you know whether we have made any progress. BTW, I compare your scores with the one in my paper, I think your model is trained not bad. According to my experiment, the best scores are always obtained in the several starting epochs, so I think you can check the result after the first several epochs.

And, thanks for your comment, keep in touch under this issue pls.

Hi, I wanted to let you know that the training proceeded past epoch 6 after I decreased batch size from the default of 32 to 16. Obvious fix in retrospect!

@aemrey, See, this is really a strange bug.