nikitakit / self-attentive-parser

High-accuracy NLP parser with models for 11 languages.

Home Page:https://parser.kitaev.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

RuntimeError: Already borrowed

indexxlim opened this issue · comments

There is currently one bug when using fast tokenizer.
If I run it to multi-thread, a bug will occur, so could you add the option use_fast = False that doesn't use fast tokenizer?

huggingface/tokenizers#537

use_fast = False is not really a viable option, because it doesn't implement return_offsets_mapping. Parsing operates over words, while pre-trained use subwords with a bunch of unicode substitution/normalization rules. The parser relies on having the tokenizer provide a mapping between subwords and character positions in the original string. "Slow" huggingface tokenizers don't implement this feature, and trying to reconstruct alignments after-the-fact is extremely error-prone due to all of the text normalization involved.

If you're using T5-based English parsers and want a solution just for yourself, you can probably modify the tokenization code to use the original sentencepiece library instead of huggingface. But I don't plan on adding such a solution to this repository, because it's not general-purpose and only works for a limited set of pre-trained models. You could also try hacking retokenization.py to have multiple tokenizer copies in thread-local storage.