Python 3.6+
git
pip3 install requests sh click
pip3 install regex docopt numpy sklearn scipy
, if you want to useclassify_xvsy_logreg.py
git clone git@github.com:sarnthil/unify-emotion-datasets.git
This will create a new folder called unify-emotion-datasets
.
First run the script that downloads all obtainable datasets:
cd unify-emotion-datasets # go inside the repository
python3 download_datasets.py
Please read carefully the instructions, you will be asked to read and confirm having read the licenses and terms of use of each dataset. In case the dataset is not obtainable directly you will be given instructions on how to obtain the dataset.
Then run the script that unifies the downloaded datasets, which will be located in unify-emotion-datasets/datasets/
:
python3 create_unified_dataset.py
This will create a new file called unified-dataset.jsonl
in the same folder.
Also, we advise you to cite the papers corresponding to the datasets you use.
The corresponding bibtex
citations you find in the file datasets/README.md
or while
running download_datasets.py
.
An Analysis of Annotated Corpora for Emotion Classification in Text
If you plan to use this corpus, please use this citation:
@inproceedings{Bostan2018,
author = {Bostan, Laura Ana Maria and Klinger, Roman},
title = {An Analysis of Annotated Corpora for Emotion Classification in Text},
booktitle = {Proceedings of the 27th International Conference on Computational Linguistics},
year = {2018},
publisher = {Association for Computational Linguistics},
pages = {2104--2119},
location = {Santa Fe, New Mexico, USA},
url = {http://aclweb.org/anthology/C18-1179},
pdf = {http://aclweb.org/anthology/C18-1179.pdf}
}
If you want to reuse the code for the emotion classification task, see the script classify_xvsy_logreg.py
:
python3 classify_xvsy_logreg.py --help
will show you the following:
Classify using MaxEnt algorithm
Usage:
classify_xvsy_logreg.py [options] <first> <second>
classify_xvsy_logreg.py [options] --all-vs <second>
Options:
-j --json=<JSONFILE> Filename of the json file [default: ../unified.jsonl]
-a --all-vs<=dataset> Dataset name of the testing data
-d --debug Use a small word list and a fast classifier
-o --output=<OUTPUT> Output folder [default: .]
-m --force-multi Force using multi-label classification
-k --keep-last Quit immediately if results file found
For example if you want to train on TEC and test on SSEC do the following:
python3 classify_xvsy_logreg.py -d tec emoint
The names of the dataset are the ones used in the file unified-dataset.jsonl
in the field source
.
Use jq
for an easy interaction with the unified-dataset.jsonl
Examples of how to use it for various tasks:
- selecting the instances of that have as a source
crowdflower
ortec
jq 'select(.source=="crowdflower" or .source =="tec")' <unified-dataset.jsonl | less
- count how often instances are annotated with high surprise per dataset
jq 'select(.emotions.surprise >0.5) | .source' <unified-dataset.jsonl | sort | uniq -c