Questions about your retrieval evaluation

Question

Questions about your retrieval evaluation

ellenmellon opened this issue 3 years ago · comments

Hi, thanks again for your great work! I was looking at your data and the retrieval evaluation script. I notice that there are many examples with empty gold answers (as you pointed out in the paper, roughly 9%), I wonder whether you calculate scores for those examples as well? If yes, would they always have the metric values to be 0, as there won't be any passages having a string span higher than 0.8 F1 score compared with the answer string. It would be great if you could confirm this. In addition, do you tokenize the answer and passages strings before calculating F1, or leave them as is?

Another relevant question, I thought you augment each gold passage by expanding it with other passages with a > 0.8 F1 string span, but your evaluation script seems to only count passages that have a > 0.8 F1 string span as gold passages. In other words, if the actual gold passage does not have a span meeting the 0.8 threshold, it would not be counted even if it has been labeled. Is that correct or did I miss something?

I'm sorry for dumping so many questions here. But it would be really appreciated if you could help me clarify them. Thanks a lot in advance!!

Svitlana Vakulenko · Answer 1 · Thu Jul 08 2021 17:00:19 GMT+0800 (China Standard Time)

Hi Ellen that's great to hear from you! You are right we did not exclude empty answers in our original evaluation. We corrected this and recalculated the scores. You can find the updated evaluation scripts that are used in the SCAI-QReCC shared task here https://github.com/scai-conf/SCAI-QReCC-21/blob/main/code/evaluation-script/scai-qrecc21-evaluator.py and the updated scores for the baseline we introduced in the paper on the leaderboard https://www.tira.io/task/scai-qrecc/dataset/scai-qrecc21-test-dataset-2021-05-15

Svitlana Vakulenko · Answer 2 · Thu Jul 08 2021 17:13:27 GMT+0800 (China Standard Time)

if the actual gold passage does not have a span meeting the 0.8 threshold, it would not be counted even if it has been labeled

That's a very good point! The reason is that our ground truth annotations do not contain any gold passages only the links to the pages that were used to produce the answers. You can consider the passage retrieval task as a form of weak supervision for QA with the F1 score as a heuristic.

Zeqiu (Ellen) Wu · Answer 3 · Fri Jul 09 2021 00:24:46 GMT+0800 (China Standard Time)

Thank you so much for your answers and quick response, Svitlana! I'll look into the SCAI-QReCC shared task page instead :)