RuntimeError: Already borrowed

Question

RuntimeError: Already borrowed

indexxlim opened this issue 3 years ago · comments

There is currently one bug when using fast tokenizer.
If I run it to multi-thread, a bug will occur, so could you add the option use_fast = False that doesn't use fast tokenizer?

huggingface/tokenizers#537

Nikita Kitaev · Answer 1 · Wed Jun 02 2021 09:32:19 GMT+0800 (China Standard Time)

use_fast = False is not really a viable option, because it doesn't implement return_offsets_mapping. Parsing operates over words, while pre-trained use subwords with a bunch of unicode substitution/normalization rules. The parser relies on having the tokenizer provide a mapping between subwords and character positions in the original string. "Slow" huggingface tokenizers don't implement this feature, and trying to reconstruct alignments after-the-fact is extremely error-prone due to all of the text normalization involved.

If you're using T5-based English parsers and want a solution just for yourself, you can probably modify the tokenization code to use the original sentencepiece library instead of huggingface. But I don't plan on adding such a solution to this repository, because it's not general-purpose and only works for a limited set of pre-trained models. You could also try hacking retokenization.py to have multiple tokenizer copies in thread-local storage.