Arabic Fake News corpora:
The following are two Arabic corpora for the task of fake news detection:
- Manual Annotated Corpus:
The annotation process resulted in a corpus containing 1,537 tweets (835 fake and 702 genuine), after excluding duplicated tweets, tweets that contain mixed fake and genuine news, and tweets where the fake news was meant as sarcasm. Statistical information about the manually annotated corpus is shown in the following table:
Fake Tweets | Not Fake Tweets | |
---|---|---|
Total Tweets | 835 | 702 |
Total Words | 20,395 | 19,852 |
Unique Words | 6,246 | 7,115 |
Total Characters | 117,630 | 113,121 |
- Automatic Annotated Corpus:
We trained different machine learning classifiers on the manually annotated corpus and used the best performing classifier to automatically predict the fake news classes of remaining unlabeled tweets. The outcome of the prediction process is 34,529 tweets (19,582 fake and 19,582 genuine) as shown in the following table.
Fake Tweets | Not Fake Tweets | |
---|---|---|
Total Tweets | 19,582 | 14,947 |
Total Words | 479,349 | 463,768 |
Unique Words | 79,383 | 88,037 |
Total Characters | 2,855,454 | 2,680,067 |
Machine Learning Classifiers:
Six machine learning classifiers were used to perform fake news classification for both datasets: Naïve Bayes [19], Logistic Regression (LR), Support Vector Machine (SVM), Multilayer Perceptron (MLP), Random Forest Bagging Model (RF), and eXtreme Gradient Boosting Model (XGB). The following are the hyper-parameters used with each classifier:
• NB: alpha=0.5
• LR: with default values
• SVM: c=1.0, kernel=linear, gamma=3
• MLP: activation function=ReLU, maximum iterations=30, learning rate=0.1
• RF: with default values
• XGB: with default values