anubrata/annotationmodeling

This repo contains consensus-based unsupervised and semi-supervised methods for estimating the quality of complex annotations based on user-specified distance functions.

Unlike related methods (e.g. Dawid-Skene, ZenCrowd, etc.), the methods here are applicable to complex annotations: those that are not expressable as binary or categorical (multiple choice) responses. Some examples may include:

Sequences - example response: [(4,18), (21,24), (102,112)]
Free text - example response: "I made a repo today"
Tree structure - example response: (S (NP I) (VP made (NP a repo) (NP today)))
Image segments - example response: {"head":[(33,10), (89, 160)], "hand":[(20,210), (40,218)]}

The raw annotation dataset should contain columns for workerID, itemID, and annotation (named however you like). Unfortunately there are no command-line methods for running this right now, the user would need to write a python script or jupyter notebook. This is because of the dependence on the user-specified distance function, which must be a python function of the following form:

def distance_fn(annotation1, annotation2):
  # CODE
  return scalar_result

Here is an example of how you might run this on a translations dataset:

import pandas as pd
from nltk.translate.gleu_score import sentence_gleu
import experiments

annotation_df = pd.read_csv("translations.csv")

distance_fn = lambda x,y: 1 - (sentence_gleu([x.split(" ")], y.split(" ")) + sentence_gleu([y.split(" ")], x.split(" "))) / 2

translation_experiment = experiments.RealExperiment(eval_fn=None,
                                                    label_colname="translation",
                                                    item_colname="sentence", uid_colname="worker",
                                                    distance_fn=dist_fn)

translation_experiment.setup(annotation_df)
translation_experiment.train()

The result would be the dictionary objects translation_experiment.bau_preds, translation_experiment.sad_preds, translation_experiment.mas_preds containing the estimated best annotations per item according to the methods:

BAU: chooses the annotation made by the worker who on average agrees most with consensus over the whole dataset.
SAD: chooses the annotation closest to all other annotations for each item (like majority vote).
MAS: chooses the annotation estimated to be best by a probabilistic model that considers both within-item consensus and worker average consensus.

anubrata / annotationmodeling

About

Languages