Can't train a language model
ofrimasad opened this issue · comments
Question
Hey I am trying to train a language model called onlplab/alephbert-base
(a Hebrew language model, closest to Roberta).
But when i call trainer.train()
I get an error:
Traceback (most recent call last):
File ".../src/train/train.py", line 161, in <module>
question_answering(run_name=opt.run_name,
File ".../src/train/train.py", line 109, in question_answering
trainer.train()
File ".../lib/python3.8/site-packages/farm/train.py", line 300, in train
logits = self.model.forward(**batch)
File ".../lib/python3.8/site-packages/farm/modeling/adaptive_model.py", line 419, in forward
sequence_output, pooled_output = self.forward_lm(**kwargs)
File ".../lib/python3.8/site-packages/farm/modeling/adaptive_model.py", line 463, in forward_lm
sequence_output, pooled_output = self.language_model(**kwargs, return_dict=False, output_all_encoded_layers=False)
File ".../lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File ".../lib/python3.8/site-packages/farm/modeling/language_model.py", line 679, in forward
output_tuple = self.model(
File ".../lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File ".../lib/python3.8/site-packages/transformers/models/roberta/modeling_roberta.py", line 815, in forward
encoder_outputs = self.encoder(
File ".../lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File ".../lib/python3.8/site-packages/transformers/models/roberta/modeling_roberta.py", line 508, in forward
layer_outputs = layer_module(
File ".../lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File ".../lib/python3.8/site-packages/transformers/models/roberta/modeling_roberta.py", line 395, in forward
self_attention_outputs = self.attention(
File ".../lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File ".../lib/python3.8/site-packages/transformers/models/roberta/modeling_roberta.py", line 323, in forward
self_outputs = self.self(
File ".../lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File ".../lib/python3.8/site-packages/transformers/models/roberta/modeling_roberta.py", line 187, in forward
mixed_query_layer = self.query(hidden_states)
File ".../lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File ".../lib/python3.8/site-packages/torch/nn/modules/linear.py", line 94, in forward
return F.linear(input, self.weight, self.bias)
File ".../lib/python3.8/site-packages/torch/nn/functional.py", line 1753, in linear
return torch._C._nn.linear(input, weight, bias)
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`
Any idea why is that happening?
My batch size is small (not close to filling the GPU mem).
I have tried LanguageModel.load(lang_model, language_model_class='Roberta')
(since my model also does not use token_type_ids)
Thanks
Additional context
farm version 0.8.0
Hey this error seems strange. I believe it is rather pytorch or huggingface transformers related than a problem within FARM.
Have you tried using only the CPU to let the code run?
Actually this post from pytorch says your cuda might be running out of memory, so you should just try to lower the batch size or max_seq length.
See https://discuss.pytorch.org/t/cuda-error-cublas-status-not-initialized-when-calling-cublascreate-handle/125450/2
hi @Timoeller.
I have actually managed to narrow this down.
My model expects batch['segment_ids']
to be all 0 (just like Roberta does).
When I use deepset/roberta-base-squad2
model this is exactly what happens.
but when using my model i get batch['segment_ids']
containing 1 as well.
I can't find in the documentation any way to set all segment_ids to 0. I suspect it is looking for the model to be Roberta at some point of your code...
Thanks
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 21 days if no further activity occurs.