cosmir / openmic-2018

Tools and tutorials for the OpenMIC-2018 dataset.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Aggregated vs raw labels

ejhumphrey opened this issue · comments

I was just working through some dataset integrity checks, and noticed that the confidence / relevance score reported from CrowdFlower isn't only a function of num_responses.

image

Looking at the raw data, it could be a mix of trusted / untrusted judgments, e.g. when an annotator dropped below a confidence level and was removed from the job, or if confidence is weighted by trust (whether this is intra- or inter-task, I'm not sure).

This raises at least two questions:

  • Are we happy to proceed with the confidence / relevance scores provided by CF?
  • Is there value in also sharing a more granular dataset of annotator responses? This table's columns could look like [sample_key, annotator_id, trust, instrument, response]

If it helps, back in 2010-11 someone at CF told me "We measure individual worker quality quite simply by their accuracy on gold units. When aggregating, we simply weight each worker by their accuracy and take the answer with the highest weighted majority vote. What works well is that we do not let people continue on our tasks if their gold accuracy is below 70% (or another specified threshold)"

As for aggregating or not, regardless of what you do, please share raw annotations.

hm, okay, thanks! I wonder if that algorithm has shifted any in the last several years.

re: "raw" annotations, @julian-urbano do you have thoughts on the proposed columns?

Last time I used CF (maybe 2013?) they provided separate files for workers' info and for their answers to the units, like this. I'd fine it very useful to have all that if possible, but if not, what you proposed looks fine.

cool, the only other thing potentially worth folding in is channel, perhaps in some kind of anonymized form? otherwise, let's start with what I've proposed and we can revisit it later if need be.

@bmcfee do you have any opinions before I do this?

@ejhumphrey not specifically; all of the above sounds good to me. As you say, it would be good to get some explicit confirmation from CF about where that rating comes from/whether the process has changed since 2013.

To whatever extent we can anonymize / protect the annotators' information, we should do so. I suspect that's not too big of a deal here, but it's worth considering before moving forward on releasing granular annotation data.

agreed, I was (implicitly) planning a non-deterministic one-way mapping for worker IDs, could be worth doing for channel as well. The remaining info named above isn't personal.

this will be addressed by #23

fixed as of #23