juliasilge / tidytext

Text mining using tidy tools :sparkles::page_facing_up::sparkles:

Home Page:https://juliasilge.github.io/tidytext/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

unnest_tokens on large corpus with limited RAM

steelcitysi opened this issue · comments

Hello,

Thank you for a wonderful tool.

I have noticed that RAM consumption becomes the computational bottleneck when unnesting tokens from a large corpus, which gets exponentially worse as the degree of n-grams used increases. In such scenarios when RAM resources are limited, is there a suggested method for running unnest_tokens on smaller partitions of a larger corpus and then combining the results? The idea is to lessen the peak demand for RAM but ultimately arrive at the same final list of tokens as if one had infinite RAM.

Thank you for the kind words @steelcitysi!

What I would recommend you do is break up your original corpus in a way that is sensible for your analysis, and then iterate through those chunks. Here is an example where I break the text up by book:

library(tidyverse)
library(tidytext)

nested_books <- 
  janeaustenr::austen_books() %>%
  nest(data = -book)

nested_books
#> # A tibble: 6 × 2
#>   book                data                 
#>   <fct>               <list>               
#> 1 Sense & Sensibility <tibble [12,624 × 1]>
#> 2 Pride & Prejudice   <tibble [13,030 × 1]>
#> 3 Mansfield Park      <tibble [15,349 × 1]>
#> 4 Emma                <tibble [16,235 × 1]>
#> 5 Northanger Abbey    <tibble [7,856 × 1]> 
#> 6 Persuasion          <tibble [8,328 × 1]>

nested_tokens <-
  nested_books %>%
  mutate(tokens = map(
    data, 
    ~ unnest_tokens(., word, text, token = "ngrams", n = 2))
  )

nested_tokens
#> # A tibble: 6 × 3
#>   book                data                  tokens                
#>   <fct>               <list>                <list>                
#> 1 Sense & Sensibility <tibble [12,624 × 1]> <tibble [111,561 × 1]>
#> 2 Pride & Prejudice   <tibble [13,030 × 1]> <tibble [114,045 × 1]>
#> 3 Mansfield Park      <tibble [15,349 × 1]> <tibble [149,201 × 1]>
#> 4 Emma                <tibble [16,235 × 1]> <tibble [150,130 × 1]>
#> 5 Northanger Abbey    <tibble [7,856 × 1]>  <tibble [72,417 × 1]> 
#> 6 Persuasion          <tibble [8,328 × 1]>  <tibble [77,671 × 1]>

nested_tokens %>%
  select(book, tokens) %>%
  unnest(tokens)
#> # A tibble: 675,025 × 2
#>    book                word           
#>    <fct>               <chr>          
#>  1 Sense & Sensibility sense and      
#>  2 Sense & Sensibility and sensibility
#>  3 Sense & Sensibility <NA>           
#>  4 Sense & Sensibility by jane        
#>  5 Sense & Sensibility jane austen    
#>  6 Sense & Sensibility <NA>           
#>  7 Sense & Sensibility <NA>           
#>  8 Sense & Sensibility <NA>           
#>  9 Sense & Sensibility <NA>           
#> 10 Sense & Sensibility <NA>           
#> # … with 675,015 more rows

Created on 2022-07-06 by the reprex package (v2.0.1)

There the function I map over is just plain unnest_tokens(., word, text, token = "ngrams", n = 2) but you could make it something that writes out to a database or file instead, and maybe use group_walk():

library(tidyverse)
library(tidytext)

unnest_tokens_and_write_csv <- function(df, file) {
  unnest_tokens(df, word, text, token = "ngrams", n = 2) %>%
    write_csv(file, append = TRUE)
}

tmp_file <- tempfile(pattern = "tokens", fileext = ".txt")

janeaustenr::austen_books() %>%
  group_by(book) %>%
  group_walk(~ unnest_tokens_and_write_csv(.x, tmp_file))

Created on 2022-07-06 by the reprex package (v2.0.1)

I don't think we will provide an automated way to do this (like dealing with out-of-memory text analysis) but these are some approaches you can adapt to your use case.

Let me know if you have further questions!

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.