EmilHvitfeldt / textdata

Download, parse, store, and load text datasets instead of storing it in packages

Home Page:https://emilhvitfeldt.github.io/textdata/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

tidytext data

EmilHvitfeldt opened this issue · comments

  • nma_words
  • parts_of_speech
  • sentiments
    • nrc
    • bing
    • loughran
    • AFINN
  • stop_words
    • onix
    • SMART
    • snowball

@juliasilge now is the time to suggest more datasets if you want 😄I know there have been interest earlier.

TBH I think NRC is going to have to be out-of-scope. I am still waiting on an email but the creator really does sound like he does not want this data redistributed at all.

I also think that stop words do not need to be in scope because of the excellent stopwords package, which tidytext depends on.

I am not sure the parts of speech dataset is worth spending time on because using this kind of unigram, tidy data approach performs quite poorly for POS tagging. You really do need a deep learning or otherwise more complex approach, such as that implemented in cleanNLP. I don't hear about anybody using this dataset really; I may just deprecate it, although I don't see a significant problem with the license either.

Perfect, everything should be in order now.