OscarKjell / text

Using Transformers from HuggingFace in R

Home Page:https://r-text.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

textEmbed Failing

scm1210 opened this issue · comments

Hi Oscar--

I've been trying to embed texts from a CSV, I haven't had an issue in the past, but some updates to R seem to have changed the efficacy of the function. When I run

job::job({
  all_embeddings <- textEmbed(data_long[2:19], #embed the transcription questions 5 through 23
                              model = "bert-large-uncased",#use bert-large-uncased
                              layers = -2, #second to last layer, this is empirically driven... 
                              aggregation_from_layers_to_tokens = "concatenate",
                              aggregation_from_tokens_to_texts = "mean",
                              aggregation_from_tokens_to_word_types = "mean",
                              keep_token_embeddings = F)
  
  saveRDS(all_embeddings, "/Users/stevenmesquiti/Desktop/LP2-wellbeing-pred/study2_all_embeddings.rds")
  rm(all_embeddings) #remove the object from our working environment 
}

the code fails and presents this error:

Warning messages:
1: In textEmbed(data_long[2:19], model = "bert-large-uncased", layers = -2,  :
  texts contain NA-values.
2: The `x` argument of `as_tibble.matrix()` must have unique column names if `.name_repair` is omitted as of tibble 2.0.0.
ℹ Using compatibility `.name_repair`.
ℹ The deprecated feature was likely used in the text package.

I haven't run into this issue in the past. I'll try to see if the devtools version replicates this error and get back to you. In the mean time, here are the specs for my machine and R that I'm using.

Mac OS Monterrey V 12.7.4
Rstudio: V 4.3.3

Thanks so much for all the work you and your team do!

just confirming i replicated the same error with the devtools version

Hi,
thanks for reporting the issue. Can you please confirm whether running this code works?

(you are reporting two warnings – but i cannot see an error.
there might be a problem regarding NAs in your text data, which the text-package should be able to handle...)

all_embeddings <- textEmbed(Language_based_assessment_data_8[1:2,1:2] , #embed the transcription questions 5 through 23
                            model = "bert-large-uncased",#use bert-large-uncased
                            layers = -2, #second to last layer, this is empirically driven... 
                            aggregation_from_layers_to_tokens = "concatenate",
                            aggregation_from_tokens_to_texts = "mean",
                            aggregation_from_tokens_to_word_types = "mean",
                            keep_token_embeddings = F)

Hi Oscar--

I was able to run that code successfully without error. I think it's an issue with the structure of this particular dataset (and in particular a unique variable). I failed to replicate the error with another dataset I have.

Thanks,
Steven