unnest_token() with token = "tweets" bug

Question

unnest_token() with token = "tweets" bug

PursuitOfDataScience opened this issue 2 years ago · comments

Hi,

I am using unnest_token(token = "tweets") and it throws me an error message the same as #119, which says Error in cut.default(seq_along(out), docindex, include.lowest = TRUE, : 'breaks' are not unique. When I go in detail to see where the problem is, there is one Tweet "LMFAOOOOO" that actually breaks the code and outputs the aforementioned message. It doesn't make a whole lot of sense to me, as the tweet is treated as "one word." Filtering out the this very Tweet or remove token = "tweets" would solve the issue, but I am not sure if this is a bug. Very strange though.

Thanks!

Julia Silge · Answer 1 · Thu Mar 10 2022 00:28:30 GMT+0800 (China Standard Time)

The underlying problem has now been fixed in the tokenizers package, but only in the development version (there hasn't been a tokenizers CRAN release since 2018). You can install the fixed version via devtools::install_github("ropensci/tokenizers").

library(tidytext)
library(tidyverse)

single_word <- tibble(sentences = "LMFAOOOOO!")

single_word %>%
    unnest_tokens(word, sentences, token = "tweets")
#> Using `to_lower = TRUE` with `token = 'tweets'` may not preserve URLs.
#> # A tibble: 1 × 1
#>   word     
#>   <chr>    
#> 1 lmfaooooo

^{Created on 2022-03-09 by the reprex package (v2.0.1)}

I know that for some situations, it is quite inconvenient or impossible to install from GitHub. If you are in that situation, you might ask Lincoln on the tokenizers repo if he is able to do a CRAN release soon.

TerencePatrick · Answer 2 · Wed Mar 23 2022 06:52:32 GMT+0800 (China Standard Time)

Hi Julia,
This code was working until a few days ago, but now there is a persistent set of error messages. What I have been trying to do is to find the Pearson Correlation that exists among the Bronte sisters themselves rather than between Jane Austen and the Bronte sisters as a collective. It seems that Gutenberg doesn't want to help me download the texts, and this has a cascading effect on the subsequent commands.

library(dplyr)
library(stringr)
library(gutenbergr)
library(ggplot2)
library(tidytext)
library(tidyr)
library(scales)

cbronte <- gutenberg_download (c(1260, 9182))
tidy_cbronte <- cbronte %>%
unnest_tokens(word, text) %>%
anti_join(stop_words)

tidy_cbronte %>%
count(word, sort = TRUE)

ebronte <- gutenberg_download(c(768))
tidy_ebronte <- ebronte %>%
unnest_tokens(word, text) %>%
anti_join(stop_words)

tidy_ebronte %>%
count(word, sort = TRUE)

abronte <- gutenberg_download(c(969, 767))
tidy_abronte <- abronte %>%
unnest_tokens(word, text) %>%
anti_join(stop_words)

tidy_abronte %>%
count(word, sort = TRUE)

frequency <- bind_rows(mutate(tidy_cbronte, author = "Charlotte Brontë"),
mutate(tidy_ebronte, author = "Emily Brontë"),
mutate(tidy_abronte, author = "Anne Brontë")) %>%
mutate(word = str_extract(word, "[a-z']+")) %>%
count(author, word) %>%
group_by(author) %>%
mutate(proportion = n / sum(n)) %>%
select(-n) %>%
pivot_wider(names_from = author, values_from = proportion) %>%
pivot_longer(Charlotte Brontë:Emily Brontë,
names_to = "author", values_to = "proportion")

frequency

cor.test(data = frequency[frequency$author == "Charlotte Brontë",],
~ proportion + Anne Brontë)

cor.test(data = frequency[frequency$author == "Emily Brontë",],
~ proportion + Anne Brontë)

Julia Silge · Answer 3 · Wed Mar 23 2022 08:13:21 GMT+0800 (China Standard Time)

Hi @TerencePatrick I don't recommend that you post on closed issues with unrelated questions. It looks like you aren't using token = "tweets" for this, correct?

TerencePatrick · Answer 4 · Wed Mar 23 2022 08:42:59 GMT+0800 (China Standard Time)

Hi Julia, My apologies. It’s the first time I’ve posted in a long time or perhaps ever. (I can’t remember if I have posted on this list before), and I was simply confused about where to post. All best wishes, Terry

On Wed, Mar 23, 2022 at 9:13 AM Julia Silge ***@***.***> wrote: Hi @TerencePatrick <https://github.com/TerencePatrick> I don't recommend that you post on closed issues with unrelated questions. It looks like you aren't using token = "tweets" for this, correct? — Reply to this email directly, view it on GitHub <#206 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ALS5DII4YYRZEFLPMXKEYY3VBJO23ANCNFSM5QBZHPOQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- Dr. Terence Murphy Editor, *Situations: Cultural Studies in the Asian Context *( http://situations.yonsei.ac.kr) Dept of English College of Liberal Arts 50 Yonsei-ro Seodaemun-gu-gu, Shinchon-dong Seoul, KOREA 03722 Amazon Author Page https://www.amazon.com/Terence-Patrick-Murphy/e/B07GL91Y8Z/ref=sr_ntt_srch_lnk_1?qid=1540804224&sr=8-1

github-actions · Answer 5 · Thu Apr 07 2022 08:09:59 GMT+0800 (China Standard Time)

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.