tidytext data

Question

tidytext data

EmilHvitfeldt opened this issue 5 years ago · comments

Emil Hvitfeldt commented 5 years ago

~~nma_words~~
~~parts_of_speech~~
sentiments
- ~~nrc~~
- bing
- loughran
- AFINN
~~stop_words~~
- ~~onix~~
- ~~SMART~~
- ~~snowball~~

Emil Hvitfeldt · Answer 1 · Wed Jun 05 2019 01:53:33 GMT+0800 (China Standard Time)

@juliasilge now is the time to suggest more datasets if you want 😄I know there have been interest earlier.

Julia Silge · Answer 2 · Wed Jun 05 2019 09:41:21 GMT+0800 (China Standard Time)

TBH I think NRC is going to have to be out-of-scope. I am still waiting on an email but the creator really does sound like he does not want this data redistributed at all.

I also think that stop words do not need to be in scope because of the excellent stopwords package, which tidytext depends on.

I am not sure the parts of speech dataset is worth spending time on because using this kind of unigram, tidy data approach performs quite poorly for POS tagging. You really do need a deep learning or otherwise more complex approach, such as that implemented in cleanNLP. I don't hear about anybody using this dataset really; I may just deprecate it, although I don't see a significant problem with the license either.

Emil Hvitfeldt · Answer 3 · Sat Jun 08 2019 03:35:22 GMT+0800 (China Standard Time)

Perfect, everything should be in order now.