google / sentencepiece

Unsupervised text tokenizer for Neural Network-based text generation.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Allow whitespace-only pieces

bauwenst opened this issue · comments

From what I understand, the allow_whitespace_only_pieces training argument, implemented in the word-level pretokeniser at this line, allows multiple spaces to appear next to each other in the strings that result from the pretokeniser (let's call them "pre-tokens"). Because the trainer gets its substrings from inside pre-tokens, having multiple spaces in one pre-token allows it to learn tokens consisting of more than one space.

I have two questions:

  1. Is this not a confusing way to name this option? When allow_whitespace_only_pieces is false, it produces pre-tokens that consist of whitespace only, which is completely counterintuitive. (It also means that there will be at least one token allowed that is whitespace-only.)
  2. For my application, what I need is what you would actually expect the option "allow whitespace-only pieces" to do, which is to produce pre-tokens with only whitespace and never mix whitespace with non-whitespace in tokens. Is this straight-forward to do by setting training options, or does it need extra implementation?

To illustrate all of this with an example: the sentence This is a test sentence. is split as follows in the three cases outlined above:

  • allow_whitespace_only_pieces = false: This ▁is ▁a ▁ ▁ ▁ ▁test ▁sentence. (seemingly allows pieces that are whitespace-only)
  • allow_whitespace_only_pieces = true: This ▁is ▁a ▁▁▁▁test ▁sentence.
  • What I need: This ▁ is ▁ a ▁▁▁▁ test ▁ sentence.

Thanks.