Some Questions and Comments

Question

tom99763 opened this issue 2 years ago · comments

Do you consider that instead of the feature map from CNN, using vector-quantized AE (VQVAE) for the future work? I think the result will be surprised due to its feature compression and sampleable properties for image-to-image translation task.
It seems like the input-output pixel correlation largely impacts the translation result during early training process (multimodal translation or Animal-to-Human translation). Instead of predicting all at ones, two stage model (first contour, next texture) may improves the result.

Thank you

Taesung Park · Answer 1 · Wed Dec 21 2022 05:37:36 GMT+0800 (China Standard Time)

Hello, thanks for suggestions.

I think incorporating VQVAE can be a good direction, particularly for saving compute.
It may, especially if we go to higher resolution. But two-stage approaches are also more cumbersome to train.