Possible confusing wording upon introducing `tokenizers`?
KaiAragaki opened this issue · comments
Please delete if this is overly pedantic or a non-issue. I found it a bit confusing in
Lines 53 to 66 in d120e8b
where it is noted that 'fir-tree' was split in half and that punctuation was lost. It then introduced the concept of packages of tokenizers
as a possible advancement to the simple technique of using str_split
. However, after using tokenize_words
, we still see that these issues remain.
To me this was a little confusing - it might not be to other people.
Thanks for writing an excellent book!
I came to the repo looking to see if others were confused by this as well, or to see if perhaps an earlier version of tokenizers
used different defaults that produced results more to the authors' point...? It would be great if there was a demonstrable difference in the results as it exists now.
Many thanks!