imlixinyang / HiSD

Official pytorch implementation of paper "Image-to-image Translation via Hierarchical Style Disentanglement" (CVPR 2021 Oral).

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Questions about the setting of the Training Phases

asheroin opened this issue · comments

According to your paper and training code, three image generate phases ( raw / self / random style code) had been used. Howerver, it's easy to pick up another similar phase, like random style image, as an training phase. Specifically, randomly pick another image as the input of style encoder, follow the same data flow like random style code. Some other works( like https://github.com/saic-mdal/HiDT ) had been picked such phases into their traning phases and got satisfied results.

As a result, I just wondering that have you been tried this phase before? How was the result be like and why did not add it into your paper?

As you can see, even without the path you mentioned, HiSD works fine with the extracted style in the inference. This is because of the cycle-translation path. In the cycle-back phase, the image is manipulated by the extracted style and we add an adversarial objective for the cycle-back image, too (you can find the significance of this objective in the ablation study). In an ideal situation, it seems unnecessary to use another phase using extracted style guiding the translation.

However, I agree that using the style extracted from a reference image during training can possibly further enhance the ability of the extractor and stabilize the training in the late iterations. The extractor is seen to cheat and fail to extract the style if you train the model for a long time.

I will definitely try this later and update the code if it helps (if does't help, I will report the result in this issue). But still, you're welcomed to have a try and edit the code yourself.

Thank you for your suggestion.

I have tried the experiments of adding another random reference image on the AFHQ dataset when training the generator and discriminator. I find adding that can help the reference-guided translation.

When running that model on the AFHQ dataset, the diversity is limited of HiSD compared to other models such as StarGAN-v2 and DRIT++. Adding another random reference image can slightly help the generation of reference-branch.

@HelenMao Thank you for sharing this result.
I mostly care about whether this opeation can stabilize the late training and whether the diversity comes from the manipulation of background or color. Do you have any idea for this?

I think this operation can stabilize the training of the Extractor in the AFHQ dataset, but I am not sure in the CelebA dataset.
In the AFHQ dataset, the extractor can easily get mode collapse, and using this can slightly improve this issue.

But as you mention in [the reply issue 19], (#19 (comment)), HiSD only focuses on manipulating the shape and maintains the background and color. That is really interesting and I have not figured out why, since other frameworks change the background and can generate diverse textures. I once thought it is because of the mask operation. HiSD translator has added the original features to the transformed features through the attention mask, but I fail to learn the mask on my own framework. In my experiment, the results of my framework still change the background and produce diverse textures when directly copying your attention module in the HiSD model.

I think the diversity loss such as mode seeking loss operation may have some influence but I am not sure, I would like to copy your attention module to the starGAN-v2 to see whether it will have some influence.

I have calculated the FID results of randomly generated results too in AFHQ dataset.
And I find with adversarial loss of randomly style images can largely improve the results.
Maybe you can try this to see whether it will have much improvement on your celebA dataset?

Thank you for this information, I will try this later. As you noticed, if there is no early stop after around 300k iterations (less tag means less iterations), the extractor will get mode collapse. It would be very helpful If this operation can improve this.

The mode seeking loss (diversification loss) may influence the disentanglement indeed, although I like the simple but effective idea very much. The reviewer also asked me why we don't add this loss. I reply that "In our setting, the gains from diversifying small objects (i.e., glasses) are far less than the gains from diversifying background colors. Therefore, we think that it may cause the manipulation of global information and aggravate the mode collapse of small objects." But I'm not sure and this is just my guess.

As you can see, even without the path you mentioned, HiSD works fine with the extracted style in the inference. This is because of the cycle-translation path. In the cycle-back phase, the image is manipulated by the extracted style and we add an adversarial objective for the cycle-back image, too (you can find the significance of this objective in the ablation study). In an ideal situation, it seems unnecessary to use another phase using extracted style guiding the translation.

However, I agree that using the style extracted from a reference image during training can possibly further enhance the ability of the extractor and stabilize the training in the late iterations. The extractor is seen to cheat and fail to extract the style if you train the model for a long time.

I will definitely try this later and update the code if it helps (if does't help, I will report the result in this issue). But still, you're welcomed to have a try and edit the code yourself.

Thank you for your suggestion.

Have you tried only use the guided image to generate the style vector? I think its more natural to get the style code from an existing image rather than generated from an random style vector like StyleGAN. Using an random style code mask this work more like an generate task rather than the transfer task.

Although some works focus on only the reference-guided task. Both of these two tasks are necessary for practical use. Imagine you want to add glasses for an input picture, the target is determined but you may not have a reference image, then you can directly sample the style code by a simple prior distribution. Therefore, reference-guided task is customized while latent-guided task is convenient. This is exactly a transfer task because the output is based on the input (eg, identity), no matter which kind of style code is used.