EmilHvitfeldt / smltar

Please delete if this is overly pedantic or a non-issue. I found it a bit confusing in

Lines 53 to 66 in d120e8b

    
           ```{r} 
        
           strsplit(the_fir_tree[1:2], "[^a-zA-Z0-9]+") 
        
           ``` 
        
           At first sight, this result looks pretty decent. However, we have lost all punctuation, which may or may not be helpful for our modeling goal, and the hero of this story (`"fir-tree"`) was split in half. Already it is clear that tokenization is going to be quite complicated. Luckily for us, a lot of work has been invested in this process, and typically it is best to use these existing tools. For example, **tokenizers** [@Mullen18] and **spaCy** [@spacy2] implement fast, consistent tokenizers we can use. Let's demonstrate with the **tokenizers** package. 
        
           ```{r} 
        
           library(tokenizers) 
        
           tokenize_words(the_fir_tree[1:2]) 
        
           ``` 
        
           We see sensible single-word results here; the `tokenize_words()` function uses the **stringi** package [@Gagolewski19] and C++ under the hood, making it very fast. Word-level tokenization is done by finding word boundaries according to the specification from the International Components for Unicode (ICU).\index{Unicode} How does this [word boundary algorithm](https://www.unicode.org/reports/tr29/tr29-35.html#Default_Word_Boundaries) work? It can be outlined as follows:

where it is noted that 'fir-tree' was split in half and that punctuation was lost. It then introduced the concept of packages of tokenizers as a possible advancement to the simple technique of using str_split. However, after using tokenize_words, we still see that these issues remain.

To me this was a little confusing - it might not be to other people.

Thanks for writing an excellent book!

I came to the repo looking to see if others were confused by this as well, or to see if perhaps an earlier version of tokenizers used different defaults that produced results more to the authors' point...? It would be great if there was a demonstrable difference in the results as it exists now.

Many thanks!


	```{r}
	strsplit(the_fir_tree[1:2], "[^a-zA-Z0-9]+")
	```

	At first sight, this result looks pretty decent. However, we have lost all punctuation, which may or may not be helpful for our modeling goal, and the hero of this story (`"fir-tree"`) was split in half. Already it is clear that tokenization is going to be quite complicated. Luckily for us, a lot of work has been invested in this process, and typically it is best to use these existing tools. For example, tokenizers [@Mullen18] and spaCy [@spacy2] implement fast, consistent tokenizers we can use. Let's demonstrate with the tokenizers package.

	```{r}
	library(tokenizers)
	tokenize_words(the_fir_tree[1:2])
	```

	We see sensible single-word results here; the `tokenize_words()` function uses the stringi package [@Gagolewski19] and C++ under the hood, making it very fast. Word-level tokenization is done by finding word boundaries according to the specification from the International Components for Unicode (ICU).\index{Unicode} How does this [word boundary algorithm](https://www.unicode.org/reports/tr29/tr29-35.html#Default_Word_Boundaries) work? It can be outlined as follows:

Possible confusing wording upon introducing `tokenizers`?