Confused about the evaluation parameters

Question

Confused about the evaluation parameters

lillyPJ opened this issue 5 years ago · comments

Hi.
According to standard Detval evaluation protocol, "tr = 0.8, tp = 0.4" (which is also your default setting in the MATLAB-code-Eval.m). But you recommend "tr = 0.7 and tp = 0.6" in your Evaluation_Protocol/README.md file.

We recommend tr = 0.7 and tp = 0.6 threshold for a fairer evaluation with polygon ground-truth and detection format.

I am confused about how to set tr and tp when I want to compare my results with other methods (listed in the Tabel Ranking)

Detection (based on DetEval evaluation protocol, unless stated)

Method	Precision (%)	Recall (%)	F-measure (%)	Published at
MSR [paper]	85.2	73.0	78.6	arXiv:1901.02596
FTSN [paper]	84.7	78.0	81.3	ICPR2018
TextSnake [paper]	82.7	74.5	78.4	ECCV2018
TextField [paper]	81.2	79.9	80.6	TIP2019
CTD [paper]	74.0	71.0	73.0	PR2019
Mask TextSpotter [paper]	69.0	55.0	61.3	ECCV2018
TextNet [paper]	68.2	59.5	63.5	ACCV2018
Textboxes [paper]	62.1	45.5	52.5	AAAI2017
EAST [paper]	50.0	36.2	42.0	CVPR2017
Baseline [paper]	33.0	40.0	36.0	ICDAR2017
SegLink [paper]	30.3	23.8	26.7	CVPR2017

ckchng · Answer 1 · Thu Mar 14 2019 15:09:32 GMT+0800 (China Standard Time)

Hi there, we believe that most of the works in the table you referred use the default values, tr = 0.8 and tp =0.4, apart from FTSN which uses Pascal VOC IoU metric.

We are currently asking authors (in the table) to send us their detection output so we can evaluate their result with tr = 0.7 and tr =0.6 (which we found are a better value in terms discouraging methods with loose detection box).

FYI, we are currently updating the table with our re-evaluation. However, we can't guarantee when will it be done since we haven't get all the authors' replies yet. Hope this helps.

lillyPJ · Answer 2 · Thu Mar 14 2019 15:19:42 GMT+0800 (China Standard Time)

When I use tr = 0.8 and tp = 0.4, I found if I expand the boundary of detection polygons, the score will be much better, which is not consistent with the visual effect. Can you check your code for this situation? Or I can send you two different results to compare.

lillyPJ · Answer 3 · Thu Mar 14 2019 15:40:13 GMT+0800 (China Standard Time)

I upload my results to https://pan.baidu.com/s/16S66fcY9cPYm2LY7s3ovlg
（code = 9xku）.
My result is below (tested by your official Matlab-code).

no_expand
tr = 0.8, tp = 0.4：Recall = 74.113, Precision = 81.710, F-score = 77.727
tr = 0.7, tp = 0.6: Recall = 78.901, Precision = 85.724, F-score = 82.171
expand
tr = 0.8, tp = 0.4：Recall = 80.816, Precision = 88.434, F-score = 84.454
tr = 0.7, tp = 0.6: Recall = 51.578, Precision = 57.197, F-score = 54.242

ckchng · Answer 4 · Thu Mar 14 2019 17:24:31 GMT+0800 (China Standard Time)

This is exactly the reason why propose the new threshold values. We found this in our experiment as well. The old values are too loose for our tight polygon ground truth format. The new threshold values are meant to discourage loose bounding box prediction. We thank you for your valuable example and your findings at Baidu Cloud.

If you are concerned about the inconsistency in your comparison (i.e. different set of thresholds used by other methods), we suggest you include both results in your manuscript and explain it accordingly. We will update our comparison table soon (with 0.7 and 0.6), since they are now the official values for Total-Text.