Allow whitespace-only pieces
bauwenst opened this issue · comments
Thomas Bauwens commented
From what I understand, the allow_whitespace_only_pieces
training argument, implemented in the word-level pretokeniser at this line, allows multiple spaces to appear next to each other in the strings that result from the pretokeniser (let's call them "pre-tokens"). Because the trainer gets its substrings from inside pre-tokens, having multiple spaces in one pre-token allows it to learn tokens consisting of more than one space.
I have two questions:
- Is this not a confusing way to name this option? When
allow_whitespace_only_pieces
is false, it produces pre-tokens that consist of whitespace only, which is completely counterintuitive. (It also means that there will be at least one token allowed that is whitespace-only.) - For my application, what I need is what you would actually expect the option "allow whitespace-only pieces" to do, which is to produce pre-tokens with only whitespace and never mix whitespace with non-whitespace in tokens. Is this straight-forward to do by setting training options, or does it need extra implementation?
To illustrate all of this with an example: the sentence This is a test sentence.
is split as follows in the three cases outlined above:
allow_whitespace_only_pieces = false
:This ▁is ▁a ▁ ▁ ▁ ▁test ▁sentence.
(seemingly allows pieces that are whitespace-only)allow_whitespace_only_pieces = true
:This ▁is ▁a ▁▁▁▁test ▁sentence.
- What I need:
This ▁ is ▁ a ▁▁▁▁ test ▁ sentence.
Thanks.