include algo of tidy autostemmer
edvardoss opened this issue · comments
Genrikh Ananiev commented
Hi Julia! I'll be happy if my algorithm of autostemming become of part of tidytext package!
https://github.com/edvardoss/abbrevTexts
Julia Silge commented
Thank you for sharing your project @edvardoss! 🙌
In #17 we discussed how to support or include stemming within tidytext and decided against it since these approaches are quite diverse and work already with a tidy data principles approach. I see that is already true of your project:
library(tidyverse)
library(tidytext)
library(abbrevTexts)
tidy_p_and_p <-
tibble(txt = janeaustenr::prideprejudice) %>%
unnest_tokens(word, txt)
p_and_p_dict <-
makeAbbrStemDict(
term.vec = tidy_p_and_p$word,
min.len = 3,
min.share = .6
)
tidy_p_and_p %>%
left_join(p_and_p_dict, by = c("word" = "parent")) %>%
mutate(word = coalesce(terminal.child, word)) %>%
anti_join(get_stopwords()) %>%
count(word, sort = TRUE)
#> Joining, by = "word"
#> # A tibble: 4,940 × 2
#> word n
#> <chr> <int>
#> 1 mr 785
#> 2 elizabeth 635
#> 3 darcy 417
#> 4 said 401
#> 5 though 344
#> 6 mrs 343
#> 7 ever 334
#> 8 much 327
#> 9 bennet 323
#> 10 bingley 306
#> # … with 4,930 more rows
## to compare
tidy_p_and_p %>%
anti_join(get_stopwords()) %>%
count(word, sort = TRUE)
#> Joining, by = "word"
#> # A tibble: 6,404 × 2
#> word n
#> <chr> <int>
#> 1 mr 785
#> 2 elizabeth 597
#> 3 said 401
#> 4 darcy 373
#> 5 mrs 343
#> 6 much 326
#> 7 must 305
#> 8 bennet 294
#> 9 miss 283
#> 10 jane 264
#> # … with 6,394 more rows
Created on 2022-12-09 with reprex v2.0.2
So we are really glad to see your approach available 🎉 but it wouldn't be something we would include in tidytext itself.