juliasilge / tidytext

Text mining using tidy tools :sparkles::page_facing_up::sparkles:

Home Page:https://juliasilge.github.io/tidytext/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

include algo of tidy autostemmer

edvardoss opened this issue · comments

Hi Julia! I'll be happy if my algorithm of autostemming become of part of tidytext package!
https://github.com/edvardoss/abbrevTexts

Thank you for sharing your project @edvardoss! 🙌

In #17 we discussed how to support or include stemming within tidytext and decided against it since these approaches are quite diverse and work already with a tidy data principles approach. I see that is already true of your project:

library(tidyverse)
library(tidytext)
library(abbrevTexts)

tidy_p_and_p <- 
    tibble(txt = janeaustenr::prideprejudice) %>%
    unnest_tokens(word, txt)

p_and_p_dict <- 
    makeAbbrStemDict(
        term.vec = tidy_p_and_p$word,
        min.len = 3,
        min.share = .6
    )

tidy_p_and_p %>%
    left_join(p_and_p_dict, by = c("word" = "parent")) %>%
    mutate(word = coalesce(terminal.child, word)) %>%
    anti_join(get_stopwords()) %>%
    count(word, sort = TRUE)
#> Joining, by = "word"
#> # A tibble: 4,940 × 2
#>    word          n
#>    <chr>     <int>
#>  1 mr          785
#>  2 elizabeth   635
#>  3 darcy       417
#>  4 said        401
#>  5 though      344
#>  6 mrs         343
#>  7 ever        334
#>  8 much        327
#>  9 bennet      323
#> 10 bingley     306
#> # … with 4,930 more rows


## to compare
tidy_p_and_p %>%
    anti_join(get_stopwords()) %>%
    count(word, sort = TRUE)
#> Joining, by = "word"
#> # A tibble: 6,404 × 2
#>    word          n
#>    <chr>     <int>
#>  1 mr          785
#>  2 elizabeth   597
#>  3 said        401
#>  4 darcy       373
#>  5 mrs         343
#>  6 much        326
#>  7 must        305
#>  8 bennet      294
#>  9 miss        283
#> 10 jane        264
#> # … with 6,394 more rows

Created on 2022-12-09 with reprex v2.0.2

So we are really glad to see your approach available 🎉 but it wouldn't be something we would include in tidytext itself.