Does transformer really help?

Question

Does transformer really help?

vztu opened this issue 3 years ago · comments

Hi @junyongyou, I noticed that your triq model has a total of 23M parameters, most of which are from ResNet50. In this sense, Transformer layers are just like an FC head. The transformer layers you used (with parameters (2, 32, 8, 64)) even have fewer parameters than the projection head used in Koncept512.

So I am wondering how much does transformers indeed help over using an FC head? Did you have the standard train-test results on CLIVE and Koniq datasets such that I can easily compare with other SoTAs? Thank you very much.

Junyong You · Answer 1 · Mon Apr 26 2021 17:10:14 GMT+0800 (China Standard Time)

Hi @junyongyou, I noticed that your triq model has a total of 23M parameters, most of which are from ResNet50. In this sense, Transformer layers are just like an FC head. The transformer layers you used (with parameters (2, 32, 8, 64)) even have fewer parameters than the projection head used in Koncept512.

So I am wondering how much does transformers indeed help over using an FC head? Did you have the standard train-test results on CLIVE and Koniq datasets such that I can easily compare with other SoTAs? Thank you very much.

Hi, you raised an interesting question. To be honest (and no offense), I don't think Koncept512 is a very good approach. IQA should have some particularities than image recognition. I also don't think simply counting model params is very appropriate for determining the performance of models.

I didn't do experiments with other backbones in TRIQ. However, I have done some work in another paper and found that the choice of backbones is not crucial for the performance, e.g., VGG16 was even better than ResNet50. I would assume it also applies in TRIQ.

If I have time later (or you could also do it), I will try different backbones, e.g., VGG16 or ResNet18... I also think a dedicated backbone with a simple architecture should be even better. However, the backbone should be pretrained on large scale database, e.g., ImageNet, as there are no large-scale IQA databases yet.

I only have trained TRIQ on the combined set of KonIQ and CLIVE,as reported in the paper. The model was then tested on SPAQ. But recently, I have also trained TRIQ on SPAQ and FLIVE. I have not compared with other SOTAs, but I would assume a similar conclusion as in this paper can be drawn.

There is another experiment to use features derived by pretrained CNNs and feed into Transformer directly to see how it works, meaning that the backbone is not trained. We can also freeze the backbone in TRIQ to see how it works. My previous experience was that it probably produces slightly worse performance.

Zhengzhong Tu · Answer 2 · Thu Apr 29 2021 00:43:12 GMT+0800 (China Standard Time)

Great, thanks for your clarification