ACT: Averaging Classifiers for Text
This is a preliminary repo for classifiers restricted to make predictions in terms of a weighted average of training data, specifically for text datasets.
Requirements:
- python3
- pytorch 0.4.0
- torchvision
- numpy
- spacy
- scikit-learn
Setup:
In addition to setting up a python environment with the packages listed above, these models assume access to Glove embeddings, which can be downloaded from https://nlp.stanford.edu/projects/glove/
By default, the models will look for the embeddings in data/glove/
but a different location can be specified at run time.
Basic Usage:
To train a model using one of the pre-specified datasets, such as StackOverflow, use:
python run.py --dataset stackoverflow
This will download the dataset to data/stackoverflow/raw
, preprocess it, train a baseline CNN model, predict on the test data, and save the output to data/temp/
.
The output directory will contain files for the train, dev, and test data, each of which is .npz file containing labels, predictions, and predicted probabilities.
To train a weighted averaging model, add --model act
Custom datasets:
To train a model on a dataset that has not been prespecified, create a directory called data/[name]/raw/
, where [name]
is the name of your dataset. In that directory, create files called train.jsonlist
and test.jsonlist
. Each of those files should contain one document per line. Each line should be a JSON object with at least two fields: "text" and "label".
For example, the first line of a file could be the following JSON object:
{"text": "This is a positive document", "label": "positive"}
To train a model on this data, use:
python run.py --dataset [name]
again replacing [name]
with the name of your dataset as above.
This will load the data, tokenize the text, and then proceeed as above.
Options:
To choose the size of the output layer for the averaging classifier, use --z-dim [dz]
, where [dz]
will be the dimensionality.
To train on a GPU, include the option --cuda
.
To choose a different output directory, use --output-dir [output-dir]
where [output-dir]
is the desired target directory.
For additional options, such as model size and optimization choices, run:
python run.py -h
Evaluation:
The eval
directory contains a number of scripts to help with evaluation. For example, to evaluate the calibration (and accuracy) of the predictions on test data in the data/temp/
directory, use:
python -m eval.eval_calibration data/temp/test.npz
To inspect the calibration and confidence values, and correcteness at a given epsilon value, say 0.1, use:
python -m eval.eval_conformal data/temp --eps 0.1
To evaluate these using the sum of weights rather than the probabilities, add --weights
.