Better support SentencePiece tokenizer(s)

Question

Better support SentencePiece tokenizer(s)

kermitt2 opened this issue 2 years ago · comments

For transformer input, we started with BERT so by considering WordPiece.
For sequence labeling, we have typically pre-segmented sentences (w_0, ..., w_n) with expected label (l_0,...,l_n) and optionally some aligned features.
We apply a wordpiece tokenizer to the pre-segmented tokens, which create some subtokens for certain tokens that we realign with the labels (and optionally the features) at the token level (to the token or all its sub-tokens).

With a SentencePiece tokenizer like Hugging Face ones, a space is added before every tokens (when is_split_into_words=True), even if originally there was no space. So this can decrease accuracy.
To better support the SentencePiece tokenizer, we could in general (also for WordPiece tokenizer) apply the tokenizers to the original non-segmented sentence, and then re-align the labels/features at sentence level using the offsets.