mr-eggplant / SAR

Code for ICLR 2023 paper (Oral) — Towards Stable Test-Time Adaptation in Dynamic Wild World

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Normalization used for VitBase (LN)

george-conrad opened this issue · comments

Hi,

thanks for sharing your work! Regarding the produced VitBase (LN) results, we noticed that there is a mismatch between the normalization you are using and the one used by timm. The timm vision transformer uses a mean of (0.5, 0.5, 0.5) and a standard deviation of (0.5, 0.5, 0.5). As a result, we got the following results for label_shifts, corresponding to Table 2 in your paper:

Acc@1 Gauss. Shot Impul. Defoc. Glass Motion Zoom Snow Frost Fog Bright. Contr. Elastic Pixel JPEG
VitBase (LN) no_adapt 46.9 47.7 46.9 42.9 34.2 50.8 44.8 57.0 52.5 56.5 76.1 31.9 46.6 65.5 66.1
TENT 58.4 59.9 59.8 58.9 56.6 62.5 59.4 66.8 24.5 70.9 79.1 63.1 65.7 73.8 71.7
SAR 59.1 60.3 60.5 59.3 57.7 62.9 59.9 67.6 66.5 70.8 79.2 63.7 66.5 74.0 71.8

All the best,
George

Hi George, thank you for bringing up this issue! We will look into it and get back to you later. Apologies for the delayed response, as we are currently traveling in Africa. Thanks for understanding.

Hi George,

Thank you for your inquiry. We did not specifically set the normalization values and simply utilized the parameters provided by the ImageNet-C dataset.

Upon investigating the issue you raised, we found that using the [0.5, 0.5, 0.5] normalization values indeed resulted in higher accuracies for no_adapt, tent, and our sar. However, this does not impact our primary conclusions: 1) models with GN/LN are better suited for stable TTA compared to those with BN, and 2) GN/LN models do not always succeed and can still experience failure cases (e.g., failing on Frost). In this failure case, our sar demonstrates stability.

On the other hand, using the [0.485, 0.456, 0.406] normalization values can be considered a more severe distribution shift. Under this more challenging distribution shift, the source accuracy (no_adapt, averaged over 15 corruptions, level 5) drops to 29.9%. However, our sar improves this accuracy to 56.3% (tent is only 47.7%), showcasing its capability. We appreciate your insight, as it has prompted us to consider this matter more deeply in the context of Stable TTA.

Best regards,