MGTAB: A Multi-Relational Graph-Based Twitter Account Detection Benchmark
MGTAB is the first standardized graph-based benchmark for stance and bot detection. MGTAB contains 10,199 expert-annotated users and 7 types of relationships, ensuring high-quality annotation and diversified relations. For more details, please refer to the MGTAB paper.
Stance | Bot | ||||
Lable | Count | Percentage | Lable | Count | Percentage |
neutral | 3,776 | 37.02 | human | 7,451 | 73.06 |
against | 3,637 | 35.66 | bot | 2,748 | 26.94 |
support | 2,786 | 27.32 |
Our proposed dataset has seven types of user relationships.
MGTAB | |||||||
Edge type | followers | friends | mention | reply | quoted | URL | hashtag |
Numbers | 308,120 | 412,575 | 114,516 | 223,466 | 77,631 | 263,800 | 300,000 |
MGTAB-large | |||||||
Edge type | followers | friends | mention | reply | quoted | URL | hashtag |
Numbers | 31,990,488 | 49,668,723 | 7,135,192 | 1,018,834 | 182,296 | 51,281 | 7,950,896 |
python 3.7
scikit-learn 1.0.2
torch 1.8.1+cu111
torch_cluster-1.5.9
torch_scatter-2.0.6
torch_sparse-0.6.9
torch_spline_conv-1.2.1
torch-geometric 2.0.4
pytorch-lightning 1.5.0
To start training process:
Train GNN models
python MGTAB-GNN.py --task stance --model GCN --relation_select 0 1 --random_seed 0 1 2 3 4
python MGTAB-GNN.py --task bot --model RGCN --relation_select 0 1 --random_seed 0 1 2 3 4
Train Machine Learning models
python MGTAB-ML.py --task stance --models_list 1 2 3 --random_seed 0 1 2 3 4
python MGTAB-ML.py --task bot --models_list 4 5 6 7 --random_seed 0 1 2 3 4
Train GNN models parallel using multi-gpu
python GNN_sample_large.py --task bot --relation_select 0 1 2 3 4 4 6 --model RGT --GPU_num 4
python GNN_sample_large.py --task bot --relation_select 0 1 2 3 4 --model SHGN --GPU_num 4
python GNN_sample_large.py --task stance --relation_select 0 1 --model GCN --GPU_num 4
python GNN_sample_large.py --task stance --relation_select 0 --model GAT --GPU_num 4
methods | type | accuracy | precision | recall | f1-score |
---|---|---|---|---|---|
AdaBoost | F | 74.59 |
74.60 |
74.02 |
73.88 |
Random Forest | F | 79.62 |
80.04 |
78.83 |
79.04 |
Decision Tree | F | 66.92 |
66.34 |
66.23 |
66.03 |
SVM | F | 81.23 |
81.40 |
80.86 |
80.71 |
KNN | F | 76.25 |
75.54 |
75.70 |
75.48 |
Logistic Regression | F | 79.51 |
79.33 |
78.83 |
78.98 |
GCN | G | 81.35 |
81.08 |
80.19 |
80.08 |
GrapgSAGE | G | 83.33 |
82.52 |
83.45 |
82.72 |
GAT | G | 82.19 |
81.72 |
81.68 |
81.04 |
HGT | G | 83.29 |
81.63 |
81.51 |
81.82 |
S-HGN | G | 85.32 |
83.93 |
83.65 |
84.42 |
BotRGCN | G | 84.71 |
83.43 |
84.08 |
84.30 |
RGT | G | 87.78 |
85.22 |
84.40 |
86.86 |
methods | type | accuracy | precision | recall | f1-score |
---|---|---|---|---|---|
AdaBoost | F | 90.12 |
88.51 |
89.10 |
87.71 |
Random Forest | F | 89.52 |
88.92 |
86.72 |
86.83 |
Decision Tree | F | 87.13 |
83.81 |
83.39 |
83.70 |
SVM | F | 88.68 |
85.73 |
85.73 |
85.31 |
KNN | F | 85.78 |
82.28 |
80.49 |
81.28 |
Logistic Regression | F | 88.49 |
85.69 |
84.41 |
84.97 |
GCN | G | 85.81 |
77.40 |
84.37 |
78.33 |
GrapgSAGE | G | 88.71 |
85.33 |
86.15 |
85.44 |
GAT | G | 86.96 |
79.71 |
84.88 |
82.33 |
HGT | G | 90.28 |
85.35 |
85.97 |
87.52 |
S-HGN | G | 91.42 |
87.40 |
86.73 |
88.72 |
BotRGCN | G | 89.60 |
85.21 |
87.07 |
87.16 |
RGT | G | 92.12 |
88.08 |
86.64 |
90.41 |
The MGTAB dataset uses the CC BY-NC-ND 4.0 license. Implemented code in the MGTAB evaluation framework uses the MIT license.
For SemEval-2016 T6, visit the SemEval2016 repository. For SemEval-2019 T7, visit the SemEval2019 github repository. For TwiBot-20, visit the TwiBot-20 github repository. For TwiBot-22, visit the TwiBot-22 github repository. For other bot detection datasets, please visit the Bot Repository.
MGTAB is available at Google Drive. MGTAB-large (contains 400,000 unlabeled users) is available at Google Drive. We also offer the standardized Cresci-15 at Google Drive. After downloading these datasets, please unzip it into path "./Dataset".