Rouge score accuracy
pltrdy opened this issue · comments
The results are known to be quite different from official ROUGE scoring script.
It has been discussed here:
google/seq2seq#89
It has been improved with #6
I compared two scoring, on multi-sentence files with 10397 lines, 508630 words, and i get:
- Official ROUGE(using
files2rouge
): (took 111 seconds)
---------------------------------------------
1 ROUGE-1 Average_R: 0.34882 (95%-conf.int. 0.34632 - 0.35132)
1 ROUGE-1 Average_P: 0.40104 (95%-conf.int. 0.39803 - 0.40391)
1 ROUGE-1 Average_F: 0.36161 (95%-conf.int. 0.35934 - 0.36383)
---------------------------------------------
1 ROUGE-2 Average_R: 0.13938 (95%-conf.int. 0.13718 - 0.14151)
1 ROUGE-2 Average_P: 0.16228 (95%-conf.int. 0.15968 - 0.16490)
1 ROUGE-2 Average_F: 0.14511 (95%-conf.int. 0.14293 - 0.14729)
---------------------------------------------
1 ROUGE-L Average_R: 0.32234 (95%-conf.int. 0.31998 - 0.32478)
1 ROUGE-L Average_P: 0.37093 (95%-conf.int. 0.36804 - 0.37374)
1 ROUGE-L Average_F: 0.33429 (95%-conf.int. 0.33208 - 0.33647)
- this code: (took 20 seconds)
{
"rouge-1": {
"f": 0.3672435871687543,
"p": 0.40349020487306564,
"r": 0.3527286721707171
},
"rouge-2": {
"f": 0.14396864450679678,
"p": 0.16098625779779233,
"r": 0.13821563233163145
},
"rouge-l": {
"f": 0.32548307280858685,
"p": 0.3741943564047806,
"r": 0.32687448001488595
}
}
Maybe the difference is caused by
Line 92 in 8255cac
split by
'.'
will remove all '.'
in hyp and ref.@shijx12 It's not the only reason, but you've got a good point, that code does not make sense.
I'm editing it and evaluating the impact. Thanks for pointing this out.
Hi @pltrdy ,
Could you run some evaluation to compare the differences between the perl script and yours ? How much does it differ ? I would love to get rid off the perl script ! https://github.com/RxNLP/ROUGE-2.0 seems to have identical scores (besides a +1 as smoothing they did not implement because not indication was present in the official ROUGE script)
@Diego999 that's precisely what I did here: #2 (comment).
In addition, results may slightly differ because of how end of sentences are handled, as suggested in #2 (comment).
@pltrdy yes but that was in February, some modifications have been done since ;) Especially the remark of #2 (comment) . Did you re-conduct experiments since ?
It must be similar if not exactly the same. I'm not sure how is the punctuation handled in the official script. I've attempted some fixes which seems to be worse. It may just be ignored, therefore naïve implementation may be the right one.
Ok, thank you for your answer !
seems to have identical scores
is it documented somewhere showing that ROUGE-2.0 has identical scores?
@AlJohri Yes last paragraph of their paper
By the way, I solved this problem here: https://github.com/Diego999/py-rouge Have a look at the README to understand when the results are different at ~4e-5 sometime
that's great to hear @Diego999! are you planning on releasing this as an independent package or merging it back into pltrdy/rouge?