Can't train a language model

Question

Can't train a language model

ofrimasad opened this issue 3 years ago · comments

Question
Hey I am trying to train a language model called onlplab/alephbert-base(a Hebrew language model, closest to Roberta).
But when i call trainer.train() I get an error:

Traceback (most recent call last):
  File ".../src/train/train.py", line 161, in <module>
    question_answering(run_name=opt.run_name,
  File ".../src/train/train.py", line 109, in question_answering
    trainer.train()
  File ".../lib/python3.8/site-packages/farm/train.py", line 300, in train
    logits = self.model.forward(**batch)
  File ".../lib/python3.8/site-packages/farm/modeling/adaptive_model.py", line 419, in forward
    sequence_output, pooled_output = self.forward_lm(**kwargs)
  File ".../lib/python3.8/site-packages/farm/modeling/adaptive_model.py", line 463, in forward_lm
    sequence_output, pooled_output = self.language_model(**kwargs, return_dict=False, output_all_encoded_layers=False)
  File ".../lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File ".../lib/python3.8/site-packages/farm/modeling/language_model.py", line 679, in forward
    output_tuple = self.model(
  File ".../lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File ".../lib/python3.8/site-packages/transformers/models/roberta/modeling_roberta.py", line 815, in forward
    encoder_outputs = self.encoder(
  File ".../lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File ".../lib/python3.8/site-packages/transformers/models/roberta/modeling_roberta.py", line 508, in forward
    layer_outputs = layer_module(
  File ".../lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File ".../lib/python3.8/site-packages/transformers/models/roberta/modeling_roberta.py", line 395, in forward
    self_attention_outputs = self.attention(
  File ".../lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File ".../lib/python3.8/site-packages/transformers/models/roberta/modeling_roberta.py", line 323, in forward
    self_outputs = self.self(
  File ".../lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File ".../lib/python3.8/site-packages/transformers/models/roberta/modeling_roberta.py", line 187, in forward
    mixed_query_layer = self.query(hidden_states)
  File ".../lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File ".../lib/python3.8/site-packages/torch/nn/modules/linear.py", line 94, in forward
    return F.linear(input, self.weight, self.bias)
  File ".../lib/python3.8/site-packages/torch/nn/functional.py", line 1753, in linear
    return torch._C._nn.linear(input, weight, bias)
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`

Any idea why is that happening?

My batch size is small (not close to filling the GPU mem).
I have tried LanguageModel.load(lang_model, language_model_class='Roberta') (since my model also does not use token_type_ids)

Thanks

Additional context
farm version 0.8.0

Timo Moeller · Answer 1 · Thu Dec 02 2021 01:49:26 GMT+0800 (China Standard Time)

Hey this error seems strange. I believe it is rather pytorch or huggingface transformers related than a problem within FARM.

Have you tried using only the CPU to let the code run?

Timo Moeller · Answer 2 · Thu Dec 02 2021 01:52:21 GMT+0800 (China Standard Time)

Actually this post from pytorch says your cuda might be running out of memory, so you should just try to lower the batch size or max_seq length.
See https://discuss.pytorch.org/t/cuda-error-cublas-status-not-initialized-when-calling-cublascreate-handle/125450/2

Ofri Masad · Answer 3 · Fri Dec 03 2021 04:07:14 GMT+0800 (China Standard Time)

hi @Timoeller.
I have actually managed to narrow this down.
My model expects batch['segment_ids'] to be all 0 (just like Roberta does).
When I use deepset/roberta-base-squad2 model this is exactly what happens.
but when using my model i get batch['segment_ids'] containing 1 as well.
I can't find in the documentation any way to set all segment_ids to 0. I suspect it is looking for the model to be Roberta at some point of your code...
Thanks

stale · Answer 4 · Sat Apr 02 2022 17:15:12 GMT+0800 (China Standard Time)

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 21 days if no further activity occurs.