About Pixel Shuffle

Question

About Pixel Shuffle

chautuankien opened this issue 2 years ago · comments

It is very interesting that you use Pixel Shuffle and Channel Attention for motion estimation without estimating optical flow.

I want to ask that in the paper you said that using Pixel Shuffle to maintain the large receptive field, so I want to ask how PS can do that.

One more question, in VFI, I usually see that people will use again the input images to reconstruct the color for the middle. So how just by applying Up Shuffle you can synthesize the middle frame?

Thank you.

Myungsub Choi · Answer 1 · Sun Nov 27 2022 15:42:22 GMT+0800 (China Standard Time)

Hi @chautuankien , thanks for you interest in our work.

I want to ask that in the paper you said that using Pixel Shuffle to maintain the large receptive field, so I want to ask how PS can do that.

PixelShuffle downscales the spatial resolution (H x W) and increases the channel dimension (C), so applying convolution with the same kernel size can cover a larger region.

For instance, if you apply a 3x3(xC) kernel to a H x W x C, the receptive field will be just 3 x 3, but if you "downshuffle" the data to H/2 x W/2 x 4C and apply the 3x3(x4C) kernel, the receptive field will be twice larger for each spatial dimension.

One more question, in VFI, I usually see that people will use again the input images to reconstruct the color for the middle. So how just by applying Up Shuffle you can synthesize the middle frame?

From what I've understood, I think you're talking about optical flow based models that use the input images for warping. Our model focuses on direct synthesis without flow-based warping, so the method is very different. There are pros and cons for each method, but flow-based works are more popular these days, to be frank..

chautuankien · Answer 2 · Mon Nov 28 2022 13:13:31 GMT+0800 (China Standard Time)

Thank you so much for your reply.

So, for the first question, how PS works is like Pooling layer, right? For example, in case of Max Pooling of stride 2, it chooses the maximum value in a 2x2 grid, to down-sampling H x W to H/2 x W/2. Therefore, the receptive field will be twice larger.

For the second question, from what I've understood, is your method a CNN-based method? You will use CNNs to directly synthesize the intermediate frame.

Another question is, why did you choose to down-shuffle only once but not more? (just like an encoder-decoder-based network, where Pooling layer is used more than once to down-sampling the data)