juliasilge / tidytext

Text mining using tidy tools :sparkles::page_facing_up::sparkles:

Home Page:https://juliasilge.github.io/tidytext/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

unnest_token() with token = "tweets" bug

PursuitOfDataScience opened this issue · comments

commented

Hi,

I am using unnest_token(token = "tweets") and it throws me an error message the same as #119, which says Error in cut.default(seq_along(out), docindex, include.lowest = TRUE, : 'breaks' are not unique. When I go in detail to see where the problem is, there is one Tweet "LMFAOOOOO" that actually breaks the code and outputs the aforementioned message. It doesn't make a whole lot of sense to me, as the tweet is treated as "one word." Filtering out the this very Tweet or remove token = "tweets" would solve the issue, but I am not sure if this is a bug. Very strange though.

Thanks!

The underlying problem has now been fixed in the tokenizers package, but only in the development version (there hasn't been a tokenizers CRAN release since 2018). You can install the fixed version via devtools::install_github("ropensci/tokenizers").

library(tidytext)
library(tidyverse)

single_word <- tibble(sentences = "LMFAOOOOO!")

single_word %>%
    unnest_tokens(word, sentences, token = "tweets")
#> Using `to_lower = TRUE` with `token = 'tweets'` may not preserve URLs.
#> # A tibble: 1 × 1
#>   word     
#>   <chr>    
#> 1 lmfaooooo

Created on 2022-03-09 by the reprex package (v2.0.1)

I know that for some situations, it is quite inconvenient or impossible to install from GitHub. If you are in that situation, you might ask Lincoln on the tokenizers repo if he is able to do a CRAN release soon.

Hi Julia,
This code was working until a few days ago, but now there is a persistent set of error messages. What I have been trying to do is to find the Pearson Correlation that exists among the Bronte sisters themselves rather than between Jane Austen and the Bronte sisters as a collective. It seems that Gutenberg doesn't want to help me download the texts, and this has a cascading effect on the subsequent commands.

library(dplyr)
library(stringr)
library(gutenbergr)
library(ggplot2)
library(tidytext)
library(tidyr)
library(scales)

cbronte <- gutenberg_download (c(1260, 9182))
tidy_cbronte <- cbronte %>%
unnest_tokens(word, text) %>%
anti_join(stop_words)

tidy_cbronte %>%
count(word, sort = TRUE)

ebronte <- gutenberg_download(c(768))
tidy_ebronte <- ebronte %>%
unnest_tokens(word, text) %>%
anti_join(stop_words)

tidy_ebronte %>%
count(word, sort = TRUE)

abronte <- gutenberg_download(c(969, 767))
tidy_abronte <- abronte %>%
unnest_tokens(word, text) %>%
anti_join(stop_words)

tidy_abronte %>%
count(word, sort = TRUE)

frequency <- bind_rows(mutate(tidy_cbronte, author = "Charlotte Brontë"),
mutate(tidy_ebronte, author = "Emily Brontë"),
mutate(tidy_abronte, author = "Anne Brontë")) %>%
mutate(word = str_extract(word, "[a-z']+")) %>%
count(author, word) %>%
group_by(author) %>%
mutate(proportion = n / sum(n)) %>%
select(-n) %>%
pivot_wider(names_from = author, values_from = proportion) %>%
pivot_longer(Charlotte Brontë:Emily Brontë,
names_to = "author", values_to = "proportion")

frequency

cor.test(data = frequency[frequency$author == "Charlotte Brontë",],
~ proportion + Anne Brontë)

cor.test(data = frequency[frequency$author == "Emily Brontë",],
~ proportion + Anne Brontë)

Hi @TerencePatrick I don't recommend that you post on closed issues with unrelated questions. It looks like you aren't using token = "tweets" for this, correct?

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.