juliasilge / tidytext

Text mining using tidy tools :sparkles::page_facing_up::sparkles:

Home Page:https://juliasilge.github.io/tidytext/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Any chance we can get parallel processing for n-grams for example?

jaymon0703 opened this issue · comments

Thanks for a great package by the way

I'm not sure that identifying n-grams is parallelizable in a straightforward way, since we need to slide along the text to find the overlapping tokens. You could do something like this using furrr, if you wanted to find n-grams for separate documents using parallel processing.

library(tidyverse)
library(tidytext)
library(furrr)
#> Loading required package: future

## nest by document
nested_austen <- janeaustenr::austen_books() %>% 
  mutate(title = book) %>%
  nest(data = c(title, text))

nested_austen
#> # A tibble: 6 × 2
#>   book                data                 
#>   <fct>               <list>               
#> 1 Sense & Sensibility <tibble [12,624 × 2]>
#> 2 Pride & Prejudice   <tibble [13,030 × 2]>
#> 3 Mansfield Park      <tibble [15,349 × 2]>
#> 4 Emma                <tibble [16,235 × 2]>
#> 5 Northanger Abbey    <tibble [7,856 × 2]> 
#> 6 Persuasion          <tibble [8,328 × 2]>


plan(multisession, workers = 2)

tokenized <- 
  nested_austen %>%
  mutate(tokens = future_map(
    data, 
    ~ unnest_tokens(., bigram, text, collapse = "title", token = "ngrams", n = 2)
  ))

tokenized %>%
  select(tokens) %>%
  unnest(tokens)
#> # A tibble: 725,049 × 2
#>    title               bigram         
#>    <fct>               <chr>          
#>  1 Sense & Sensibility sense and      
#>  2 Sense & Sensibility and sensibility
#>  3 Sense & Sensibility sensibility by 
#>  4 Sense & Sensibility by jane        
#>  5 Sense & Sensibility jane austen    
#>  6 Sense & Sensibility austen 1811    
#>  7 Sense & Sensibility 1811 chapter   
#>  8 Sense & Sensibility chapter 1      
#>  9 Sense & Sensibility 1 the          
#> 10 Sense & Sensibility the family     
#> # … with 725,039 more rows

Created on 2021-11-29 by the reprex package (v2.0.1)

In the general case, there is a fair amount of complexity in specifying what chunks of text should go to parallel workers.

Thanks Julia. You are right. This may be more effort than it is worth. I would appreciate others' thoughts before making a decision on whether or not to close the issue.

Let me know if you have further questions!

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.