python==3.8
torch==1.10.1
transformers==4.18.0
PyYAML==5.4.1
tqdm==4.62.3
pandas==1.2.5
numpy==1.21.2
scikit-learn==1.0.2
Through call process_tweepy_data.py and define the data type (i.e., train or dev or test or covid, pre-processe the tweet text including source tweets and replies and also extract statistic features from raw data.
cd ./task1/pre-process
python process_tweepy_data.py -m train
python process_tweepy_data.py -m dev
python process_tweepy_data.py -m test
python process_tweepy_data.py -m covid
Call process_data.py to fill in missing data using KNN interpolation and do standard scaler on them.
cd ./task1/pre-process
python process_data.py
The results data will be stored as .csv file in res file.
- Expand abbreviations (e.g., don't -> do not)
- Remove punctuations, numbers, twitter ID, url
- Tokenization (i.e., convert sentences into lists of words)
- Stemming (i.e., remove any form of a word and change it into its root form)
- Convert emoticon into corresponding words using emoji package in Python
- Tweet Content's Features
- The number of question marks in the tweet text
- The number of url mentioned in the tweet text
- The number of twitter ID mentioned in the tweet text
- The sentiment of the tweet text, which is calculated by TextBlob in Python and then tagged with Positive (i.e., 1) or Negative(i.e., 0).
- Tweet Statistical Features
- Reply Count: The number of replies of the tweet
- Favorite Count: The number of twitters like the tweet
- User Features
- Friends Count
- Protected or not
- Verified or not
- User Engagement
- Following Rate
- Favourite Rate
cd ./task1/main_models
python train.py -m textbertweet -l 128 # Note: 128 is the longest for this model
cd ./task1/main_models
python train.py -m textbertweet_tf -l 128 # Note: 128 is the longest for this model
cd ./task1/main_models
python train.py -m textbert -l 256 # Note: 512 is the longest for this model
cd ./task1/main_models
python train.py -m textbert_tf -l 256 # Note: 512 is the longest for this model
Define the model name (i.e., textbertweet / textbertweet_tf / textbert / textbert_tf), max length of bert, test data type (i.e., test or covid) and existing model path. For example, predict rumour or non-rumour on test data with the existing bert-based with statistics model bertcls_16.dat.
cd ./task1/main_models
python test.py -m bert -l 256 -t textbertweet_tf -e bertcls_16.dat