Is factCC reliable for factual correctness evaluation?
nightdessert opened this issue · comments
I really appreciate the excellent paper.
I tested factCC on CNN/DM dataset using gold reference sentences as claims(splitted into single sentence).
I strictly followed md, and used the official pre-trained factCC checkpoint.
I labeled all the claims as 'CORRECT' (because they are gold references).
The accuracy output by factCC is around 42% which means the model thinks only 42% of the reference sentences is factuality correct.
Is this reasonable or did I wrongly use the metric ?
Noticed the metric based on uncased-bert, I did use lower-cased inputs .
I've got same result...
When I used 'generated data' and 'annotated data', It worked well. But gold data(cnndm) results strange.
I used summary sentences as claim.
(one summary has several sentences, I splitted it and used each sentence as each claim.)
As a fact, I also encountered such a problem. Just like the method mentioned above, I used the gold summary to evaluate and got the following result:
***** Eval results *****
bacc = 0.41546565056595314
f1 = 0.41546565056595314
loss = 3.5899247798612546
On the author annotated dataset,give the result as follows:
***** Eval results *****
bacc = 0.7611692646110668
f1 = 0.8614393125671321
loss = 0.8623681812816171
Some of my observations below:
- The manual test set is strongly label-imbalanced. Only 62 out of 503 examples are incorrect. A majority voting baseline would give acc and F1 of 0.5 and 0.88, basically the same as the MNLI or FEVER baselines.
- The FactCC model performs much worse on the incorrect class. The F1 scores for the correct and incorrect classes are 0.92 and 0.49.
- The CNN/DM models that generated the predictions are highly extractive (validated by many papers). So in the case of "correct", it's very trivial since almost the exact sentence is contained in the source article.
- I'm not surprised that it performed poorly on the gold summaries because 1) the dataset contains noise. If you take individual sentences out of the summary, it may not make sense (e.g. unresolved pronouns). 2) As I mentioned above in point 3, the correct cases in the manual test set are often trivial. However, the gold summaries contain a lot of paraphrase and abstraction. There's also considerable "hallucination" (the summary contains information not mentioned in the article). Therefore, the model is very likely to predict as "incorrect".
In summary, I think FactCC can identify local errors like swapping entities or numbers. However, don't count on it to solve the hard NLI problem. Overall, it's still one of the better metrics. You can also check out the following paper.
Goyal, Tanya, and Greg Durrett. "Evaluating factuality in generation with dependency-level entailment." arXiv preprint arXiv:2010.05478 (2020).
Greatly Appreciate discussion above.
Is there anyone retrain and finetuing the model to get result on CNNDM or other dataset?
Will that help to get a more precision Fact-Evalutaion? If not, FactCC as evaluator is reliable to mentioned in Paper as a Metric?
I notice that somepaper use FactCC as a metric
If FactCC remains problem, then the result maybe not reliable to be a metric mentioned in Paper
@Ricardokevins, you can take a look at the following two comprehensive surveys on actuality metrics. What's disturbing is that they have very different conclusions. If you're writing a paper, the best you can do is to pick 1-2 metric from each category (e.g., entailment, QA, optionally IE) and report the result of all of them. Also you need to do a small scale human evaluation like 50-100 summaries.
Gabriel, Saadia, et al. "Go figure! a meta evaluation of factuality in summarization." arXiv preprint arXiv:2010.12834 (2020).
Pagnoni, Artidoro, Vidhisha Balachandran, and Yulia Tsvetkov. "Understanding factuality in abstractive summarization with FRANK: A benchmark for factuality metrics." arXiv preprint arXiv:2104.13346 (2021).
@Ricardokevins, you can take a look at the following two comprehensive surveys on actuality metrics. What's disturbing is that they have very different conclusions. If you're writing a paper, the best you can do is to pick 1-2 metric from each category (e.g., entailment, QA, optionally IE) and report the result of all of them. Also you need to do a small scale human evaluation like 50-100 summaries.
Gabriel, Saadia, et al. "Go figure! a meta evaluation of factuality in summarization." arXiv preprint arXiv:2010.12834 (2020).
Pagnoni, Artidoro, Vidhisha Balachandran, and Yulia Tsvetkov. "Understanding factuality in abstractive summarization with FRANK: A benchmark for factuality metrics." arXiv preprint arXiv:2104.13346 (2021).
thanks a lot <3
As a fact, I also encountered such a problem. Just like the method mentioned above, I used the gold summary to evaluate and got the following result: ***** Eval results ***** bacc = 0.41546565056595314 f1 = 0.41546565056595314 loss = 3.5899247798612546
On the author annotated dataset,give the result as follows: ***** Eval results ***** bacc = 0.7611692646110668 f1 = 0.8614393125671321 loss = 0.8623681812816171
My annotated dataset result is the same as yours. This result, however, is not consistent with the Table-3 F-1 score for FactCC. Anyone have an intuition for why?