This is the code for "How Emotionally Stable is ALBERT? Testing Robustness with Stochastic Weight Averaging on a Sentiment Analysis Task".
To run the training and evaluation for this paper, please set up the environment:
# Create environment.
conda create -n robustness-albert python=3.7
conda activate robustness-albert
# Install packages.
python setup.py develop
pip install -r requirements.txt
First, create a config file (see configs/example_config.json
for an example).
Then, run the following:
robustness_albert/train.py -c configs/CONFIG_FILE_NAME.json
For linting and unit testing, run the following commands:
# Linting.
flake8 robustness-albert
# Unit testing.
pytest -s tests/
test_models_checklist.ipynb
: This notebook carries out the CheckList tests on
all the random seeds.
dev_set_results.ipynb
: This notebook loads the results of the different random seeds
on the development set of SST-2 and calculates the Fleiss' Kappa agreement between
the models.
extract_names_sst2.ipynb
: This notebook extracts and saves names that occur in the
movie reviews of the train and test set, so we can use these names for the designed
CheckList capabilities. Resulting names can be found in assets/names_sst2_train.json
and assets/names_sst2_test.json
plot_checklist_results.ipynb
: This notebook plots the results achieved from the
CheckList tests for all random seeds. It plots the error rates and overlap ratios and
calculates the Fleiss' Kappa agreement.
extract_test_labels_sst2.ipynb
: As the SST-2 dataset in HuggingFace does not come with the
test labels, this notebook is used to extract them using the original SST-2 data from
GLUE.
Labels can be found in assets/sst2_test_labels.json
create_checklist_tests_sst2.ipynb
: This notebook creates the CheckList test suite that we use for
the results. Resulting test suite can be found in assets/testset_19_07_21.pkl
.