Evaluate the reconstruction effect of the GAN method

Question

Evaluate the reconstruction effect of the GAN method

Marshall-yao opened this issue 5 years ago · comments

Hi，Xintao. Excuse me,After reading SFTGAN, ESRGAN, and RankGAN papers, i would like to discuss with you about how to evaluate the reconstruction effect of the GAN method.

1)SFTGAN uses the method of user evaluation toevaluate reconstruction effect. This is not as convincing as the objective evaluation criteria, and may be rejected by the reviewers.

2）ESRGAN uses standard test sets to test PSNR and SSIM . And the test results are very high and refreshed. This clearly illustrates the effectiveness of the method used and is more convincing to the reviewer.

3）RankSRGAN uses NIQE and other evaluation metrics that are more suitable for the GAN method.

If I want to use SFTGAN as the baseline (running time considerations), based on the above considerations, should I use the NIQE evaluation method ?
Is subjective evaluation necessary? Are there other evaluation methods?

Best regards.

Xintao · Answer 1 · Mon Sep 02 2019 22:51:05 GMT+0800 (China Standard Time)

The evaluation for perceptual SR methods is still difficult. There are several ways:

PSNR/SSIM: it is not suitable for perceptual methods. Although some papers use this metric for perceptual-oriented SR method, these metrics are only references and do not represent the actual performance. (BTW, in ESRGAN paper, there are two variants: one is trained with only L2/L1 loss, for the best PSNR; the other is trained with VGG and GAN loss, for perceptual quality. I think your mentioned PSNR/SSIM is from the PSNR-oriented model)
Perceptual indexes/ NIQE/Ma's score. They do better than PSNR/SSIM, and are used in the PIRM18 SR competition. PI (perceptual index) is well correlated with the human-opinion-scores on a coarse scale, but it is not always well-correlated with these scores on a finer scale. This highlights the urgent need for better perceptual quality metrics. (More analyses can be found in Sec 4.1 and Sec 5 in PIRM18-SR Challenge report https://arxiv.org/pdf/1809.07517.pdf.)
So PI/NIQE/Ma are better metrics than PSNR/SSIM, but they are not good enough.
Human evaluation/user study. This is the last choice. It has its own drawbacks, such as non-reproducebility, human bias, etc.

So, there is still no one faithful metric for perceptual SR. Usually, we examine all of these metrics and all these metrics give a side view of the algorithm performance.
In practice during training, I usually visualize some 'typical' regions of selected images. Though bias will be introduced, it somehow gives a direct evaluation of whether the model is good or not.

lu yao · Answer 2 · Wed Sep 04 2019 10:15:23 GMT+0800 (China Standard Time)

thanks so much for your patient reply.

I have heard of a classification of results to evaluate the performance of the GAN method.
What do you think about this method?
By saying ' visualize some 'typical' regions of selected images', which images do you usually
choose and which regions of the image are selected?
For example, the selected regions are hair, grass, beard and other areas.
There is a disadvantage to the method of judging the performance of a model by visualizing
the results of the model. If the difference between the visual effect of the improved method and
the original method is not obvious, it is difficult to distinguish the quality of the improved model.

I think , in this case, it should judge by means of PSNR or perceptual index.
What is your opinion of this problem?

Thanks so much.

Xintao · Answer 3 · Fri Sep 06 2019 23:01:45 GMT+0800 (China Standard Time)

It is also a choice to measure the performance.
As you said, hair/grass/building/plant textures. Usually, are the regions that are difficult to restore.
Sometimes, the PSNSR /perceptual index does not reflect the perceptual quality as human.
In real-world, such as the companies to develop algorithms for mobile phones, they also evaluate the algorithm by visualization. So it may be still a necessary way.