HandanYU / Rumour-detection

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Rumour-detection

Requirements

python==3.8
torch==1.10.1
transformers==4.18.0
PyYAML==5.4.1
tqdm==4.62.3
pandas==1.2.5
numpy==1.21.2
scikit-learn==1.0.2

Data Pre-processing

Through call process_tweepy_data.py and define the data type (i.e., train or dev or test or covid, pre-processe the tweet text including source tweets and replies and also extract statistic features from raw data.

cd ./task1/pre-process
python process_tweepy_data.py -m train
python process_tweepy_data.py -m dev
python process_tweepy_data.py -m test
python process_tweepy_data.py -m covid

Call process_data.py to fill in missing data using KNN interpolation and do standard scaler on them.

cd ./task1/pre-process
python process_data.py

The results data will be stored as .csv file in res file.

1. Pre-processing tweet text

  • Expand abbreviations (e.g., don't -> do not)
  • Remove punctuations, numbers, twitter ID, url
  • Tokenization (i.e., convert sentences into lists of words)
  • Stemming (i.e., remove any form of a word and change it into its root form)
  • Convert emoticon into corresponding words using emoji package in Python

2. Extract Feature from raw data

  • Tweet Content's Features
    • The number of question marks in the tweet text
    • The number of url mentioned in the tweet text
    • The number of twitter ID mentioned in the tweet text
    • The sentiment of the tweet text, which is calculated by TextBlob in Python and then tagged with Positive (i.e., 1) or Negative(i.e., 0).
  • Tweet Statistical Features
    • Reply Count: The number of replies of the tweet
    • Favorite Count: The number of twitters like the tweet
  • User Features
    • Friends Count
    • Protected or not
    • Verified or not
    • User Engagement
    • Following Rate
    • Favourite Rate

Model training

1. BERTweet with only raw tweet text (TextBERTweet)

cd ./task1/main_models
python train.py -m textbertweet -l 128  # Note: 128 is the longest for this model

2. BERTweet with raw tweet text and statistics features (TextBERTweet-TF)

cd ./task1/main_models
python train.py -m textbertweet_tf -l 128  # Note: 128 is the longest for this model

3. Bert-Base with only pre-processed tweet text (TextBERT)

cd ./task1/main_models
python train.py -m textbert -l 256 # Note: 512 is the longest for this model

4. Bert-Base with pre-processed tweet text and statistics features (TextBERT-TF)

cd ./task1/main_models
python train.py -m textbert_tf -l 256 # Note: 512 is the longest for this model

Model test

Define the model name (i.e., textbertweet / textbertweet_tf / textbert / textbert_tf), max length of bert, test data type (i.e., test or covid) and existing model path. For example, predict rumour or non-rumour on test data with the existing bert-based with statistics model bertcls_16.dat.

cd ./task1/main_models
python test.py -m bert -l 256 -t textbertweet_tf -e bertcls_16.dat

About


Languages

Language:Jupyter Notebook 88.5%Language:Python 11.5%