juliasilge / tidytext

Text mining using tidy tools :sparkles::page_facing_up::sparkles:

Home Page:https://juliasilge.github.io/tidytext/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

tidytext idf negative values due to wrong counting of number of documents

StefanoRapisarda opened this issue · comments

Hi guys,

I am running some text mining and for one specific word ("the") I've got negative idf values. The word "the" is present in all the documents of my sample so, according to the idf definition, idf("the") should be zero. Here there is my workflow:

data_file_name <- 'times_ocr=80_100&date=1945-01-01_2010-12-31&query=_european_union_&category=News.csv'
data_df <- read_delim(data_file_name, delim = ";", escape_double = FALSE, col_types = cols(`date-pub` = col_date(format = "%B %d, %Y")), trim_ws = TRUE)

issue_words <- data_df %>%
  unnest_tokens(word, content) %>%
  count(issue, word, sort = TRUE)

issue_tf_idf <- issue_words %>%
  bind_tf_idf(word, issue, n)

issue_tf_idf %>%
  arrange(tf_idf)

I narrowed down the problem to a wrong counting of the total number of documents (issues) in the sample.
The number of documents is 1305, you can find out with:

print(nrow(data_df %>% distinct(issue)))

However, inside the function bind_tf_idf() the total number of documents is computed from the result of tapply() (so from grouping) and the count is 1304. You can check it with (extracted from bind_tf_idf()) :

print(length(tapply(issue_words$n, issue_words$issue, sum)))

Because of this wrong counting, the idf of "the" results negative.

Can you confirm the issue? Am I missing something? Why does tapply() "skip" a group? Why is the number of documents in bind_tf_idf() computed via tapply() instead of, for example, using distinct()?

Here you can find the csv data.

times_ocr=80_100&date=1945-01-01_2010-12-31&query=european_union&category=News.csv

... and obviously this happened because there was a NA in the issue column of the DataFrame, '''tapply()''' ignored it when grouping, but if you count the number of rows after applying distinct the NA is still there. Pardon.

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.