juliasilge / tidytext

Text mining using tidy tools :sparkles::page_facing_up::sparkles:

Home Page:https://juliasilge.github.io/tidytext/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Parts_of_speech Expected Behavior

nabsiddiqui opened this issue · comments

The parts_of_speech tibble seems to work differently than I believe most users would expect, When you use a left_join on this tibble as you do for sentiment analysis, some words have multiple entries when they should ideally have the most commonly tagged part of speech.

For instance, the word "feel" in the "data-raw/mobyposi.i.zip" file only shows up once as a "VtiN," which is being recorded in the parts_of_speech list as four entries: one for the word "feel" as Verb(usu participle), another for Verb (transitive), another for Verb (intransitive), and a last one for Noun. If you do a join, the data becomes inflated. You can call distinct() on the word column, but it isn't clear which entry for the word to use. Is the first one the most likely tag? There is no documentation making this clear.

The updated version from the SUBTLEXus site contains a column for the most dominant Part of Speech: https://www.ugent.be/pp/experimentele-psychologie/en/research/documents/subtlexus/. I believe most users expect the parts_of_speech list to contain the word and this column.

Thanks for this discussion @nabsiddiqui! You may notice that we point folks toward the SUBTLEXus dataset in the docs for parts_of_speech to make sure people know about it and can use it if appropriate. That dataset, at least as of when I added those docs, does not have an open source license and they specify it is only for non-commercial use. This makes it challenging and/or impossible (depending on who you ask) to include it in an open source package.

Thank you so much for the response @juliasilge. Sorry to be such a nuisance. I'm actually working on a book, to be in contract with Springer soon hopefully, looking at Tidy Data and Cultural Analytics. The section on text analysis uses tidytext.

I'm aware of the SUBTLEXus dataset but was seeking to hopefully avoid readers having to go through that process as I am also unsure of the copyright implications of including it in the book. For now, I will guide readers to do a distinct() call or slice to get the first value. I will try to update the documentation later on to also make a note of this and will send a pull request when its done.

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.