Diego999 / py-rouge

Full Python implementation of the ROUGE metric, producing same results as in the official perl implementation.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

avg & best the same

DevZiegler opened this issue · comments

Hi, can it be that apply_avg and apply_best always output the same no matter what was selected?

Evaluation with Avg
	rouge-1:	P: 75.71	R: 75.00	F1: 74.72
	rouge-2:	P: 66.67	R: 62.50	F1: 63.93
	rouge-3:	P: 58.33	R: 50.00	F1: 53.33
	rouge-4:	P: 50.00	R: 37.50	F1: 41.67
	rouge-l:	P: 78.85	R: 78.61	F1: 78.26
	rouge-w:	P: 74.65	R: 53.28	F1: 61.69

Evaluation with Best
	rouge-1:	P: 75.71	R: 75.00	F1: 74.72
	rouge-2:	P: 66.67	R: 62.50	F1: 63.93
	rouge-3:	P: 58.33	R: 50.00	F1: 53.33
	rouge-4:	P: 50.00	R: 37.50	F1: 41.67
	rouge-l:	P: 78.85	R: 78.61	F1: 78.26
	rouge-w:	P: 74.65	R: 53.28	F1: 61.69

Evaluation with Individual
	Hypothesis #0 & Reference #0: 
		rouge-1:	P: 80.00	R: 80.00	F1: 80.00
	Hypothesis #1 & Reference #0: 
		rouge-1:	P: 42.86	R: 60.00	F1: 50.00
	Hypothesis #2 & Reference #0: 
		rouge-1:	P: 100.00	R: 80.00	F1: 88.89
	Hypothesis #3 & Reference #0: 
		rouge-1:	P: 80.00	R: 80.00	F1: 80.00

	Hypothesis #0 & Reference #0: 
		rouge-2:	P: 75.00	R: 75.00	F1: 75.00
	Hypothesis #1 & Reference #0: 
		rouge-2:	P: 16.67	R: 25.00	F1: 20.00
	Hypothesis #2 & Reference #0: 
		rouge-2:	P: 100.00	R: 75.00	F1: 85.71
	Hypothesis #3 & Reference #0: 
		rouge-2:	P: 75.00	R: 75.00	F1: 75.00

	Hypothesis #0 & Reference #0: 
		rouge-3:	P: 66.67	R: 66.67	F1: 66.67
	Hypothesis #1 & Reference #0: 
		rouge-3:	P:  0.00	R:  0.00	F1:  0.00
	Hypothesis #2 & Reference #0: 
		rouge-3:	P: 100.00	R: 66.67	F1: 80.00
	Hypothesis #3 & Reference #0: 
		rouge-3:	P: 66.67	R: 66.67	F1: 66.67

	Hypothesis #0 & Reference #0: 
		rouge-4:	P: 50.00	R: 50.00	F1: 50.00
	Hypothesis #1 & Reference #0: 
		rouge-4:	P:  0.00	R:  0.00	F1:  0.00
	Hypothesis #2 & Reference #0: 
		rouge-4:	P: 100.00	R: 50.00	F1: 66.67
	Hypothesis #3 & Reference #0: 
		rouge-4:	P: 50.00	R: 50.00	F1: 50.00

	Hypothesis #0 & Reference #0: 
		rouge-l:	P: 83.03	R: 83.03	F1: 83.03
	Hypothesis #1 & Reference #0: 
		rouge-l:	P: 49.36	R: 65.33	F1: 56.23
	Hypothesis #2 & Reference #0: 
		rouge-l:	P: 100.00	R: 83.03	F1: 90.73
	Hypothesis #3 & Reference #0: 
		rouge-l:	P: 83.03	R: 83.03	F1: 83.03

	Hypothesis #0 & Reference #0: 
		rouge-w:	P: 80.00	R: 57.98	F1: 67.23
	Hypothesis #1 & Reference #0: 
		rouge-w:	P: 38.61	R: 39.18	F1: 38.89
	Hypothesis #2 & Reference #0: 
		rouge-w:	P: 100.00	R: 57.98	F1: 73.40
	Hypothesis #3 & Reference #0: 
		rouge-w:	P: 80.00	R: 57.98	F1: 67.23

Hi,

you only use one reference summary. That means that the average score and the best score is simply the score wrt. the single reference summary. You will only get different results here when you use multiple different reference summaries.