microsoft / fastformers

FastFormers - highly efficient transformer models for NLU

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

trying to train roberta-large model for question-answering task

skiran252 opened this issue · comments

❓ Questions & Help

I am trying to convert the Roberta-large model to Fastformers. I am facing this issue with data files after preprocessing

Details

runcate_sequences
assert len(ids) > num_tokens_to_remove
AssertionError

what did lead me to this error
A link to original question on Stack Overflow:

Hi. What data set are you using? And, are you using --do_lower_case?

Hi. What data set are you using? And, are you using --do_lower_case?

yes I was using that.

here is some trace of it
Traceback (most recent call last):
File "/usr/local/lib/python3.7/multiprocessing/pool.py", line 121, in worker
result = (True, func(*args, **kwds))
File "/usr/local/lib/python3.7/multiprocessing/pool.py", line 44, in mapstar
return list(map(*args))
File "/home/saikiran/fastformers/venv/lib/python3.7/site-packages/transformers/data/processors/squad.py", line 142, in squad_convert_example_to_features
return_token_type_ids=True,
File "/home/saikiran/fastformers/venv/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 1521, in encode_plus
**kwargs,
File "/home/saikiran/fastformers/venv/lib/python3.7/site-packages/transformers/tokenization_utils.py", line 372, in _encode_plus
verbose=verbose,
File "/home/saikiran/fastformers/venv/lib/python3.7/site-packages/transformers/tokenization_utils.py", line 578, in _prepare_for_model
stride=stride,
File "/home/saikiran/fastformers/venv/lib/python3.7/site-packages/transformers/tokenization_utils.py", line 675, in truncate_sequences
assert len(ids) > num_tokens_to_remove
AssertionError
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "run_squad.py", line 827, in
main()
File "run_squad.py", line 765, in main
train_dataset = load_and_cache_examples(args, tokenizer, evaluate=False, output_examples=False)
File "run_squad.py", line 459, in load_and_cache_examples
threads=args.threads,
File "/home/saikiran/fastformers/venv/lib/python3.7/site-packages/transformers/data/processors/squad.py", line 331, in squad_convert_examples_to_features
disable=not tqdm_enabled,
File "/home/saikiran/fastformers/venv/lib/python3.7/site-packages/tqdm/std.py", line 1171, in iter
for obj in iterable:
File "/usr/local/lib/python3.7/multiprocessing/pool.py", line 325, in
return (item for chunk in result for item in chunk)
File "/usr/local/lib/python3.7/multiprocessing/pool.py", line 748, in next
raise value
AssertionError

We have only implemented SuperGLUE data processing at the moment. MultiRC in SuperGLUE is a QA dataset. You may want to try it. Or, you can follow SuperGLUE data processing implementation for your dataset.

https://github.com/microsoft/fastformers/blob/main/src/transformers/data/processors/superglue.py#L453

No activities for 6+ months. Closing.