Strange evaluation

Question

Strange evaluation

tudor-berariu opened this issue 9 years ago · comments

In vqaEval.py, lines 97-104, the code that computes the accuracy for a generated answer seems to produce strange values. For example, if a question has 8 "yes" answers and 2 "no" answers (provided by the workers), the accuracy of a generated answer would be 0.533 for "no" and 0.2 for "yes".

accuracy for "yes": 2/10 * min(1, 8/3)
accuracy for "no": 8/10 * min(1, 2/3)

Can you please explain the reasons for that specific evaluation scheme?

Aishwarya Agrawal · Answer 1 · Wed Mar 23 2016 01:11:37 GMT+0800 (China Standard Time)

Lines 97-104 are doing the following --

For a given generated answer, the generated answer is evaluated against all 10 choose 9 sets of ground truth answers (for loop in line 97). In each such iteration, the evaluation uses the metric --
min(1, (number of matching answers out of 9 ground truth answers)/3) [line 100].

So if a question has 8 "yes" answers and 2 "no" answers (provided by the workers), the accuracy of a generated answer would be 0.6 for "no" and 1.0 for "yes". Below is the detailed calculation --

accuracy for "no" -- 1/10 * ( 8 * min(1, 2/3) + 2 * min(1, 1/3) ) = 0.6
accuracy for "yes" -- 1/10 * ( 8 * min(1, 7/3) + 2 * min(1, 8/3) ) = 1

BTW, the accuracies that you mentioned -- 0.533 for "no" and 0.2 for "yes", are these the results from running the code (line 97-104 in vqaEval.py)?

Stanislaw Antol · Answer 2 · Wed Mar 23 2016 01:29:59 GMT+0800 (China Standard Time)

As mentioned in the paper, we choose all subsets of all 9 answers so that we have consistency between the human evaluation scores (without collecting an eleventh answer) and the results people will report on automatically generated answers.

Tudor Berariu · Answer 3 · Wed Mar 23 2016 02:10:44 GMT+0800 (China Standard Time)

Thank you very much for your answer.
I misunderstood line 98, this piece of code precisely: if item!=gtAnsDatum. I thought it's meant to remove all answers identical to the current one.

Thank you for your time

Stanislaw Antol · Answer 4 · Wed Mar 23 2016 02:12:09 GMT+0800 (China Standard Time)

Yes, it's a subtle one due to gtAnsDatum being an object and not the actual string.