bot-detection dataset gnn-algorithm stance-detection

MGTAB

MGTAB: A Multi-Relational Graph-Based Twitter Account Detection Benchmark

Introduction

MGTAB is the first standardized graph-based benchmark for stance and bot detection. MGTAB contains 10,199 expert-annotated users and 7 types of relationships, ensuring high-quality annotation and diversified relations. For more details, please refer to the MGTAB paper.

Distribution of labels in annotations.

Stance			Bot
Lable	Count	Percentage	Lable	Count	Percentage
neutral	3,776	37.02	human	7,451	73.06
against	3,637	35.66	bot	2,748	26.94
support	2,786	27.32

MGTAB contains 10,199 expert-annotated users, and 400,000 additional unlabelled users in MGTAB-large compared to MGTAB.

Multiple relations in the MGTAB.

Our proposed dataset has seven types of user relationships.

MGTAB
Edge type	followers	friends	mention	reply	quoted	URL	hashtag
Numbers	308,120	412,575	114,516	223,466	77,631	263,800	300,000
MGTAB-large
Edge type	followers	friends	mention	reply	quoted	URL	hashtag
Numbers	31,990,488	49,668,723	7,135,192	1,018,834	182,296	51,281	7,950,896

Enviromment

python 3.7
scikit-learn 1.0.2
torch 1.8.1+cu111
torch_cluster-1.5.9
torch_scatter-2.0.6
torch_sparse-0.6.9
torch_spline_conv-1.2.1
torch-geometric 2.0.4
pytorch-lightning 1.5.0

Train Model

To start training process:

Train GNN models

python MGTAB-GNN.py  --task stance --model GCN --relation_select 0 1 --random_seed 0 1 2 3 4
python MGTAB-GNN.py  --task bot --model RGCN --relation_select 0 1 --random_seed 0 1 2 3 4

Train Machine Learning models

python MGTAB-ML.py  --task stance --models_list 1 2 3  --random_seed 0 1 2 3 4
python MGTAB-ML.py  --task bot --models_list 4 5 6 7  --random_seed 0 1 2 3 4

Train GNN models parallel using multi-gpu

python GNN_sample_large.py  --task bot --relation_select 0 1 2 3 4 4 6 --model RGT --GPU_num 4
python GNN_sample_large.py  --task bot --relation_select 0 1 2 3 4 --model SHGN --GPU_num 4
python GNN_sample_large.py  --task stance --relation_select 0 1 --model GCN --GPU_num 4
python GNN_sample_large.py  --task stance --relation_select 0 --model GAT --GPU_num 4

Baseline performance

Stance detection performance on MGTAB

methods	type	accuracy	precision	recall	f1-score
AdaBoost	F	74.59 $_{1.41}$	74.60 $_{1.35}$	74.02 $_{1.61}$	73.88 $_{1.47}$
Random Forest	F	79.62 $_{0.68}$	80.04 $_{0.43}$	78.83 $_{0.98}$	79.04 $_{0.82}$
Decision Tree	F	66.92 $_{0.93}$	66.34 $_{1.02}$	66.23 $_{1.06}$	66.03 $_{0.84}$
SVM	F	81.23 $_{0.66}$	81.40 $_{0.71}$	80.86 $_{1.09}$	80.71 $_{0.78}$
KNN	F	76.25 $_{1.32}$	75.54 $_{1.41}$	75.70 $_{1.37}$	75.48 $_{1.37}$
Logistic Regression	F	79.51 $_{1.00}$	79.33 $_{0.98}$	78.83 $_{1.17}$	78.98 $_{1.11}$
GCN	G	81.35 $_{0.58}$	81.08 $_{0.30}$	80.19 $_{0.56}$	80.08 $_{0.56}$
GrapgSAGE	G	83.33 $_{1.22}$	82.52 $_{1.63}$	83.45 $_{0.63}$	82.72 $_{1.34}$
GAT	G	82.19 $_{1.23}$	81.72 $_{1.19}$	81.68 $_{1.16}$	81.04 $_{1.24}$
HGT	G	83.29 $_{0.44}$	81.63 $_{0.58}$	81.51 $_{0.76}$	81.82 $_{0.34}$
S-HGN	G	85.32 $_{0.53}$	83.93 $_{0.67}$	83.65 $_{0.65}$	84.42 $_{0.43}$
BotRGCN	G	84.71 $_{1.43}$	83.43 $_{1.23}$	84.08 $_{0.94}$	84.30 $_{1.44}$
RGT	G	87.78 $_{0.43}$	85.22 $_{0.89}$	84.40 $_{0.74}$	86.86 $_{0.43}$

Bot detection performance on MGTAB

methods	type	accuracy	precision	recall	f1-score
AdaBoost	F	90.12 $_{0.92}$	88.51 $_{1.33}$	89.10 $_{0.92}$	87.71 $_{1.10}$
Random Forest	F	89.52 $_{0.44}$	88.92 $_{0.49}$	86.72 $_{1.15}$	86.83 $_{0.53}$
Decision Tree	F	87.13 $_{0.51}$	83.81 $_{0.72}$	83.39 $_{1.06}$	83.70 $_{0.74}$
SVM	F	88.68 $_{1.40}$	85.73 $_{1.84}$	85.73 $_{1.84}$	85.31 $_{1.73}$
KNN	F	85.78 $_{0.84}$	82.28 $_{1.22}$	80.49 $_{0.64}$	81.28 $_{0.66}$
Logistic Regression	F	88.49 $_{1.31}$	85.69 $_{1.69}$	84.41 $_{1.96}$	84.97 $_{1.67}$
GCN	G	85.81 $_{1.32}$	77.40 $_{2.12}$	84.37 $_{1.73}$	78.33 $_{1.67}$
GrapgSAGE	G	88.71 $_{1.24}$	85.33 $_{1.83}$	86.15 $_{2.55}$	85.44 $_{1.08}$
GAT	G	86.96 $_{1.28}$	79.71 $_{2.96}$	84.88 $_{1.52}$	82.33 $_{2.12}$
HGT	G	90.28 $_{0.29}$	85.35 $_{0.33}$	85.97 $_{0.41}$	87.52 $_{0.37}$
S-HGN	G	91.42 $_{0.43}$	87.40 $_{0.67}$	86.73 $_{0.64}$	88.72 $_{0.58}$
BotRGCN	G	89.60 $_{0.82}$	85.21 $_{1.81}$	87.07 $_{1.38}$	87.16 $_{0.74}$
RGT	G	92.12 $_{0.37}$	88.08 $_{0.43}$	86.64 $_{0.25}$	90.41 $_{0.47}$

Licensing

The MGTAB dataset uses the CC BY-NC-ND 4.0 license. Implemented code in the MGTAB evaluation framework uses the MIT license.

Datasets download

For SemEval-2016 T6, visit the SemEval2016 repository. For SemEval-2019 T7, visit the SemEval2019 github repository. For TwiBot-20, visit the TwiBot-20 github repository. For TwiBot-22, visit the TwiBot-22 github repository. For other bot detection datasets, please visit the Bot Repository.

MGTAB is available at Google Drive. MGTAB-large (contains 400,000 unlabeled users) is available at Google Drive. We also offer the standardized Cresci-15 at Google Drive. After downloading these datasets, please unzip it into path "./Dataset".

About

A Multi-relational Graph-Based Twitter Account Detection Benchmark

bot-detection dataset gnn-algorithm stance-detection

Languages

Language:Python 100.0%