OscarKjell / text

Using Transformers from HuggingFace in R

Home Page:https://r-text.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Issue with texts with > 512 tokens

adamramey opened this issue · comments

Hello, I'm running the latest Github version of r-text. Everything is working fine, except one issue. On page 6 of your Psychological Methods piece, you say "If longer sequences of text are submitted to the text-package function, the text will be split up in smaller chunks; and then the word embeddings may be aggregated to rep- resent the entire text."

However, when I feed text with > 512 tokens, no matter what model I choose, I'm always getting output with 512 x d (where d = 768 or 1024 depending on the model) for the tokens.

I've tried to dig through the R functions to see what's going on and I can't seem to figure it out.

Hi,
thanks for providing feedback. Can you give a reproducible example?
I have a quick look using textEmbed() as below and the output includes:

$tokens$texts[[1]]

A tibble: 798 × 769

long_text_test <- c("Humour (British English) or humor (American English; see spelling differences) is the tendency to experiences to provoke laughter and provide amusement. The term derives from the humoral medicine of the ancient Greeks, which taught that the balance of fluids in the human body, known as humours (Latin: humor, body fluid), controlled human health and emotion.
 People of all ages and cultures respond to humour. Most people are able to experience humour—be amused, smile or laugh at something funny (such as a pun or joke)—and thus are considered to have a sense of humour. The hypothetical person lacking a sense of humour would likely find the behaviour inducing it to be inexplicable, strange, or even irrational. Though ultimately decided by personal taste, the extent to which a person finds something humorous depends on a host of variables, including geographical location, culture, maturity, level of education, intelligence and context. For example, young children may favour slapstick such as Punch and Judy puppet shows or the Tom and Jerry cartoons, whose physical nature makes it accessible to them. By contrast, more sophisticated forms of humour such as satire require an understanding of its social meaning and context, and thus tend to appeal to a more mature audience.
 Humour (British English) or humor (American English; see spelling differences) is the tendency of experiences to provoke laughter and provide amusement. The term derives from the humoral medicine of the ancient Greeks, which taught that the balance of fluids in the human body, known as humours (Latin: humor, body fluid), controlled human health and emotion.
 People of all ages and cultures respond to humour. Most people are able to experience humour—be amused, smile or laugh at something funny (such as a pun or joke)—and thus are considered to have a sense of humour. The hypothetical person lacking a sense of humour would likely find the behaviour inducing it to be inexplicable, strange, or even irrational. Though ultimately decided by personal taste, the extent to which a person finds something humorous depends on a host of variables, including geographical location, culture, maturity, level of education, intelligence and context. For example, young children may favour slapstick such as Punch and Judy puppet shows or the Tom and Jerry cartoons, whose physical nature makes it accessible to them. By contrast, more sophisticated forms of humour such as satire require an understanding of its social meaning and context, and thus tend to appeal to a more mature audience.
 Humour (British English) or humor (American English; see spelling differences) is the tendency of experiences to provoke laughter and provide amusement. The term derives from the humoral medicine of the ancient Greeks, which taught that the balance of fluids in the human body, known as humours (Latin: humor, body fluid), controlled human health and emotion.
 People of all ages and cultures respond to humour. Most people are able to experience humour—be amused, smile or laugh at something funny (such as a pun or joke)—and thus are considered to have a sense of humour. The hypothetical person lacking a sense of humour would likely find the behaviour inducing it to be inexplicable, strange, or even irrational. Though ultimately decided by personal taste, the extent to which a person finds something humorous depends on a host of variables, including geographical location, culture, maturity, level of education, intelligence and context. For example, young children may favour slapstick such as Punch and Judy puppet shows or the Tom and Jerry cartoons, whose physical nature makes it accessible to them. By contrast, more sophisticated forms of humour such as satire require an understanding of its social meaning and context, and thus tend to appeal to a more mature audience.
 ")

long_text_embedding <- textEmbed(
  long_text_test,
  model = "bert-base-uncased"
)

Thanks so much - so I get the same results as you! It seems the issue has to do with my actual text example. There was some issues with a gsub call I used to remove strange characters. It has nothing to do with the package after all! Now I have the bigger problem - the full size text is so large that it's crashing my system!

thanks for confirming that.
Good luck!