Weibo senti 100k is very likely labelled by the emoticons

Question

Weibo senti 100k is very likely labelled by the emoticons

ThiagoSousa opened this issue 6 years ago · comments

Thiago de Sousa Silveira commented 6 years ago

I downloaded this dataset(ChineseNlpCorpus/datasets/weibo_senti_100k) to train a model for chinese sentiment analysis. Upon treating this dataset I observed that 100% of the posts contain emoticons. Here is the distribution of the top10 emoticons according to the positive and negative polarity:

1013 emoticons in total. They are: [('泪', 44489), ('哈哈', 40510), ('嘻嘻', 22370), ('抓狂', 17262), ('鼓掌', 15923), ('爱你', 12685), ('怒', 12011), ('衰', 10466), ('晕', 9440), ('偷笑', 8375)]

710 emoticons in the positive set. They are: [('哈哈', 35764), ('嘻嘻', 20115), ('鼓掌', 14836), ('爱你', 11349), ('偷笑', 5223), ('太开心', 3820), ('可爱', 3809), ('心', 2122), ('赞', 1991), ('给力', 1976)]

695 emoticons in the negative set. They are: [('泪', 43248), ('抓狂', 16643), ('怒', 11830), ('衰', 10202), ('晕', 9022), ('哈哈', 4746), ('偷笑', 3152), ('蜡烛', 2887), ('汗', 2456), ('嘻嘻', 2255)]

I trained a very simple model to classify and I obtained 98% of accuracy in 2 epochs. Therefore, the emoticons have a strong bias in the classification. It led me to conclude that this dataset is not manually annotated. Probably whoever annotated the dataset manually classified some frequent emoticons and use them to tag the posts. Just saying for anyone who want to gather this data, you'd probably like to clean the emoticons out of it to avoid bias.

Peace!

En Ouyang · Answer 1 · Fri Jan 04 2019 14:12:22 GMT+0800 (China Standard Time)

lol, the findings are really interesting! @ThiagoSousa

jinhuakst · Answer 2 · Tue Jan 15 2019 19:59:44 GMT+0800 (China Standard Time)

@ThiagoSousa Yeah. Thank you for your comments.

xixi · Answer 3 · Fri Aug 02 2019 15:45:22 GMT+0800 (China Standard Time)

thx for your work.

easywaytodo · Answer 4 · Thu Oct 24 2019 14:48:58 GMT+0800 (China Standard Time)

could I use it in bert and how I should do the preprocessing for the data? are emoticons out of vocabulary?