EmilHvitfeldt / smltar

Manuscript of the book "Supervised Machine Learning for Text Analysis in R" by Emil Hvitfeldt and Julia Silge

Home Page:https://smltar.com

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Chapter 6 code freezing RStudio

PursuitOfDataScience opened this issue · comments

commented

Hi,

Thanks for such an amazingly written book on text analysis using tidymodels.

I've been trying to replicate the code presented in Chapter 6 about using scotus to train a few machine learning models. When I use prep() on the scotus_rec recipe, my R session got stuck and had to force it to stop. I encountered the same result even after I changed max_tokens = 300. I am wondering if it is expensive to run these chunks of code? It seems like the data set isn't big enough to cause such an issue. Of course, there is no way to train the model, as it is ridiculously slow and the session memory could reach to 8 GB.

Thanks!

Hello @PursuitOfDataScience, That sounds a little extreme, the data isn't super small, but it shouldn't give you problems with that amount of memory. Can you run session_info() for me and paste it here, we can try to figure out why this is happening to you.

library(tidymodels)
library(tidyverse)
library(textrecipes)
library(scotus)

set.seed(1234)
scotus_split <- scotus_filtered %>%
  mutate(year = as.numeric(year),
         text = str_remove_all(text, "'")) %>%
  initial_split()

scotus_train <- training(scotus_split)
scotus_test <- testing(scotus_split)

scotus_rec <- recipe(year ~ text, data = scotus_train) %>%
  step_tokenize(text) %>%
  step_tokenfilter(text, max_tokens = 1e3) %>%
  step_tfidf(text) %>%
  step_normalize(all_predictors())

scotus_prep <- prep(scotus_rec)

lobstr::obj_size(scotus_prep)
#> 254,989,104 B

Created on 2022-04-28 by the reprex package (v2.0.1)

commented

Hi, thanks for the answer!

I reran the code and it was way faster than it when I ran yesterday. I guess there was something wrong in my R session yesterday. Now it seems like all is good. Thanks!