SophonPlus / ChineseNlpCorpus

搜集、整理、发布 中文 自然语言处理 语料/数据集,与 有志之士 共同 促进 中文 自然语言处理 的 发展。

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Weibo senti 100k is very likely labelled by the emoticons

ThiagoSousa opened this issue · comments

I downloaded this dataset(ChineseNlpCorpus/datasets/weibo_senti_100k) to train a model for chinese sentiment analysis. Upon treating this dataset I observed that 100% of the posts contain emoticons. Here is the distribution of the top10 emoticons according to the positive and negative polarity:

1013 emoticons in total. They are: [('泪', 44489), ('哈哈', 40510), ('嘻嘻', 22370), ('抓狂', 17262), ('鼓掌', 15923), ('爱你', 12685), ('怒', 12011), ('衰', 10466), ('晕', 9440), ('偷笑', 8375)]

710 emoticons in the positive set. They are: [('哈哈', 35764), ('嘻嘻', 20115), ('鼓掌', 14836), ('爱你', 11349), ('偷笑', 5223), ('太开心', 3820), ('可爱', 3809), ('心', 2122), ('赞', 1991), ('给力', 1976)]

695 emoticons in the negative set. They are: [('泪', 43248), ('抓狂', 16643), ('怒', 11830), ('衰', 10202), ('晕', 9022), ('哈哈', 4746), ('偷笑', 3152), ('蜡烛', 2887), ('汗', 2456), ('嘻嘻', 2255)]

I trained a very simple model to classify and I obtained 98% of accuracy in 2 epochs. Therefore, the emoticons have a strong bias in the classification. It led me to conclude that this dataset is not manually annotated. Probably whoever annotated the dataset manually classified some frequent emoticons and use them to tag the posts. Just saying for anyone who want to gather this data, you'd probably like to clean the emoticons out of it to avoid bias.

Peace!

lol, the findings are really interesting! @ThiagoSousa

@ThiagoSousa Yeah. Thank you for your comments.

commented

thx for your work.

could I use it in bert and how I should do the preprocessing for the data? are emoticons out of vocabulary?