SQuAD 2.0 cuda runtime error

Question

SQuAD 2.0 cuda runtime error

Theerit opened this issue 6 years ago · comments

Hi all,
I am pretty new to field of pytorch and deep learning. I am interested in your implementation and tried running your code for both SQuAD 2 and 1.1. I was success at running the first version but failed to run the second version and encountered problem as per below.

File "train.py", line 168, in
main()
File "train.py", line 111, in main
results, labels = predict_squad(model, dev_data, v2_on=args.v2_on)
File "/home/san_mrc/my_utils/data_utils.py", line 34, in predict_squad
phrase, spans, scores = model.predict(batch)
File "/home/san_mrc/src/model.py", line 112, in predict
start, end, lab = self.network(batch)
File "/home/anaconda2/envs/SAN/lib/python3.7/site-packages/torch/nn/modules/module.py", line 477, in call
result = self.forward(*input, **kwargs)
File "/home/san_mrc/src/dreader.py", line 88, in forward
doc_elmo, query_elmo = self.lexicon_encoder(batch)
File "/home/anaconda2/envs/SAN/lib/python3.7/site-packages/torch/nn/modules/module.py", line 477, in call
result = self.forward(*input, **kwargs)
File "/home/san_mrc/src/encoder.py", line 146, in forward
doc_cove_low, doc_cove_high = self.ContextualEmbed(doc_tok, doc_mask)
File "/home/anaconda2/envs/SAN/lib/python3.7/site-packages/torch/nn/modules/module.py", line 477, in call
result = self.forward(*input, **kwargs)
File "/home/san_mrc/src/recurrent.py", line 140, in forward
output1, _ = self.rnn1(pack(x_hiddens[indices], lens.tolist(), batch_first=True))
RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1535493744281/work/aten/src/THC/generated/../THCReduceAll.cuh:317

The config that I changed is fix_embeddings and number of epoches
The error appeared after training the first epoch, before getting to dump/save the model file.