Aggregated vs raw labels

Question

Aggregated vs raw labels

ejhumphrey opened this issue 6 years ago · comments

I was just working through some dataset integrity checks, and noticed that the confidence / relevance score reported from CrowdFlower isn't only a function of num_responses.

Looking at the raw data, it could be a mix of trusted / untrusted judgments, e.g. when an annotator dropped below a confidence level and was removed from the job, or if confidence is weighted by trust (whether this is intra- or inter-task, I'm not sure).

This raises at least two questions:

Are we happy to proceed with the confidence / relevance scores provided by CF?
Is there value in also sharing a more granular dataset of annotator responses? This table's columns could look like [sample_key, annotator_id, trust, instrument, response]

Julián Urbano · Answer 1 · Wed Aug 29 2018 19:19:26 GMT+0800 (China Standard Time)

If it helps, back in 2010-11 someone at CF told me "We measure individual worker quality quite simply by their accuracy on gold units. When aggregating, we simply weight each worker by their accuracy and take the answer with the highest weighted majority vote. What works well is that we do not let people continue on our tasks if their gold accuracy is below 70% (or another specified threshold)"

As for aggregating or not, regardless of what you do, please share raw annotations.

Eric J. Humphrey · Answer 2 · Wed Aug 29 2018 19:38:06 GMT+0800 (China Standard Time)

hm, okay, thanks! I wonder if that algorithm has shifted any in the last several years.

re: "raw" annotations, @julian-urbano do you have thoughts on the proposed columns?

Julián Urbano · Answer 3 · Wed Aug 29 2018 19:46:44 GMT+0800 (China Standard Time)

Last time I used CF (maybe 2013?) they provided separate files for workers' info and for their answers to the units, like this. I'd fine it very useful to have all that if possible, but if not, what you proposed looks fine.

Eric J. Humphrey · Answer 4 · Wed Aug 29 2018 19:52:22 GMT+0800 (China Standard Time)

cool, the only other thing potentially worth folding in is channel, perhaps in some kind of anonymized form? otherwise, let's start with what I've proposed and we can revisit it later if need be.

Eric J. Humphrey · Answer 5 · Thu Aug 30 2018 01:26:54 GMT+0800 (China Standard Time)

@bmcfee do you have any opinions before I do this?

Brian McFee · Answer 6 · Thu Aug 30 2018 01:32:34 GMT+0800 (China Standard Time)

@ejhumphrey not specifically; all of the above sounds good to me. As you say, it would be good to get some explicit confirmation from CF about where that rating comes from/whether the process has changed since 2013.

To whatever extent we can anonymize / protect the annotators' information, we should do so. I suspect that's not too big of a deal here, but it's worth considering before moving forward on releasing granular annotation data.

Eric J. Humphrey · Answer 7 · Thu Aug 30 2018 01:38:50 GMT+0800 (China Standard Time)

agreed, I was (implicitly) planning a non-deterministic one-way mapping for worker IDs, could be worth doing for channel as well. The remaining info named above isn't personal.

Eric J. Humphrey · Answer 8 · Thu Aug 30 2018 03:24:16 GMT+0800 (China Standard Time)

this will be addressed by #23

Eric J. Humphrey · Answer 9 · Tue Sep 18 2018 19:07:27 GMT+0800 (China Standard Time)

fixed as of #23