JingyunLiang / SwinIR

In my knowledge, the input in transformer must be fixed resolution, in test time, often take patch overlap method to test image in transformer.in your code, I want to know how the method you take, and the idea. I saw that like any resolution can be feed in swinIR? how to do it?
Looking forward your reply, thanku.!

You're actually right. Since the positional encoding is fixed after training, the attention matrix is fixed for all Transformers, as far as I know.

For patch-based Transformers (e.g. ViT and IPT), attention matrix is calculated among different small image patches. fixed patch size (e.g. 4x4) + fixed attention matrix size (e.g., ((48/4)^2)x((48/4)^2)) --> fixed input size (e.g., 48x48).
For Swin Transformer-based SwinIR, attention matrix is calculated among different pixels within a small image patch.
fixed pixel size (1x1) + fixed attention matrix size (e.g., 8^2x8^2) --> fixed input size (e.g., 8x8).

However, one key different between them is that Swin Transformer uses the same Attention Module for all non-overlapping 8x8 image patches (similar to a 8x8 convolution with stride=8). It can easily be used on any images, as long as their sizes are a multiple of 8 (8x8, 16x16, 24x24, etc.)

In practice, given any testing image, we can pad it to be a multiple of 8 and test it with SwinIR. See Lines 56-63 for the padding code.

SwinIR/main_test_swinir.py

Line 56 in 5bd10ce

# pad input image to be a multiple of window_size

more question. in padding code, you use h_pad = (h_old // window_size + 1 ) * window_size - h_old, why this padding is confusion to me. like set5's baby 's low resolution is 128*128, then 128/8 = 16, can divided , why padding it. thankyou

Yes, it is compatible with Set5's baby, but not with other images. This is why we need padding.

As for the reason to pad baby.png (already a multiple of window_size), I think it can slightly improve the quality of border areas. Maybe you can try not to pad baby.png and compare their PSNR difference. Report it here if you find something interesting. Thank you.

Hi, using a 48x48 patch for training, tests can be done using any resolution, can this only be based on swin transformer?
I understand that swin is doing attention within a window of 8x8, the window is fixed, so it can test with any resolution. Am I wrong in my understanding?

Suppose I use a normal transformer block that is doing attention within the whole picture. In this case can I also put in any resolution to do the test. For example, if I use VIT and use a 48X48 patch for training, and dynamically give in the image size at deforwad, the model will be able to do the test without reporting errors. Am I correct in this idea?

Your understanding of swin transformer is correct. One difference from other transformers is that it uses a small attention window_size (8x8).
Currently, most transformer model take the testing input of the same size as the training input. For example, IPT uses 48x48 in training and uses 48x48 patches in testing, so they have to use a sliding window scheme in testing. It is slower and will lead to border artifact near the patch border.
If you input training patches of dynamic sizes, it may be a problem for global attention based models (e.g. ViT and IPT), especially in the field of image restoration (not image recognition).

Since the attention matrix size is changing every time, it may be hard for the query, key and value to learn good representations. It's like you have a dynamic receptive field in CNN.
Sometimes you have to compete with others in a 64x64 image (due to softmax). The next time, you have to compete with others in a 1024x1024 images. The second case may be overwhelmed by many small values in softmax.
The GPU memory limit when you train with large images.

You can try to train a model by your dynamic cropping strategy, but I think it will not outperform current models. For image restoration, the performance may even drop.

Thank you, your answer clarified my thinking a lot, but there is one thing.
I am not feeding different size images in the training phase.
I use 48x48 patches for training in the training phase and 1024x720 images for testing when applied to IPT and VIT, and the model does not report errors due to the dynamic feeding of the image size. It is the same as when you pass x_size to each module in the code.
Since the attention window in the training phase is 48x48 images, this can seriously affect the performance in terms of image recovery when testing with other arbitrary resolution images? I have a feeling that this is not the right way to go, but I am not sure.

Yes, I can expect a severe performance drop if you do so. Just test your idea on IPT and see the PSNR drop.

I am confused that you change window_size to 8 instead of 7 ?
You just padding image to a multiple of window_size, while swin transformer first padding image to multiple of patch_size ? Is that because 8 is the multiple of 4, and so you change the window_size.

7 also works. I choose 8 because 6x8=48, 8x8=64, which can be used for fair comparison with existing SR works.

Besides, 8x8 doesn't work for JPEG compression artifact reduction. One possible reason is that JPEG uses 8x8 patches in encoding.

Thank you. For JEPG, the input image size is 126 , it gets 133 after padding, and then through patch embeding , the patch resolution is 33 . In the window_partition , it cannot be divided exactly and maybe lose some information, am I corrrect? I see swin transformer for detection padding for several times, so I am confused why padding like this.

No, we use 126x126 patch and 7x7 window size for training JPEG CAR, so the patch number is (126/7)x(126/7)=18x18. We don't use any padding inside the model. In testing, we pad the testing image to be a multiple of 7. See

SwinIR/main_test_swinir.py

Line 56 in 5bd10ce

# pad input image to be a multiple of window_size

不，我们使用 126x126 的补丁和 7x7 的窗口大小来训练 JPEG CAR，因此补丁编号是 (126/7)x(126/7)=18x18。我们不在模型内部使用任何填充。在测试中，我们将测试图像填充为 7 的倍数。参见

SwinIR/main_test_swinir.py

Line 56 in 5bd10ce

# pad input image to be a multiple of window_size

Thanks a lot ! I get it !

Thanks for sharing this contributional work! In padding code, you use h_pad = (h_old // window_size + 1 ) * window_size - h_old can process the arbitrary images, but the outputs of SwinIR are different the inputs. Is it possible to ensure the input and output sizes are always the same? Even the size of input image is not a multiple of 8.

No. SwinIR operates on a small window (8x8), so you always have to pad it to be a multiple of 8 in testing. After testing, you crop it to be the same size as the GT HR image. This operation has little impact on the final performance.

No. SwinIR operates on a small window (8x8), so you always have to pad it to be a multiple of 8 in testing. After testing, you crop it to be the same size as the GT HR image. This operation has little impact on the final performance.

Thanks a lot, I will try it.

Feel free to open it if you have more questions.

SwinIR solves the resolution-adaptivity problem of Transformers for low-level vision, which is great. However, the adopted window attention can only attain local interactions, which might restrict its model capacity.

We gently invite you to check out our MAXIM model accepted to CVPR 2022 Oral. It contains both global and local MLPs that can also directly test on images with arbitrary sizes. We test on slightly different image restoration tasks -- denoising, deblurring, draining, dehazing, enhancement. Our code and model has been released at https://github.com/google-research/maxim

Why SwinIR can be directly (not patch by patch) tested on images with arbitrary sizes?