TencentARC / MasaCtrl

[ICCV 2023] Consistent Image Synthesis and Editing

Home Page:https://ljzycmd.github.io/projects/MasaCtrl/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

SD1.5 and SD2.1 not suitable for DDIM inverse

shouwangzhe200 opened this issue · comments

I tried the code in playground_real.ipynb on SD1.4, SD1.5, and SD2.1, and found that the inverse process using DDIM can only generate images that are consistent with the original image for SD1.4. For SD1.5, the generated images have significant deviations, while for SD2.1, the reconstructed images through the inverse process completely collapse.

Hi @shouwangzhe200, for the failure of DDIM inversion, you can increase the success rate by:

  1. Increase the inversion steps. In our script, the default steps for inversion is only 50, thus you can use larger inversion steps to obtain satisfying results (\eg, 500). This may alleviate the poor reconstruction quality problem with the SD1.5 model.
  2. As for the collapse of SD2.1, you should be cautious about the prediction_type of the U-Net. That is to say, the checkpoint in the hugging face repo here is trained with velocity prediction (prediction_type="v_prediction" rather than the epsilon). While the implementation in our repo relies on the epsilon prediction, thus please ensure that you use their released model here trained with prediction_type="epsilon".

In addition, with the above suggestions, DDIM inversion still reconstructs the image with poor quality in some cases. Some following works like Null-text Inversion can further alleviate this. You can refer to #30 (comment) for more details.

Hope this can help you.

Hi @shouwangzhe200, for the failure of DDIM inversion, you can increase the success rate by:

  1. Increase the inversion steps. In our script, the default steps for inversion is only 50, thus you can use larger inversion steps to obtain satisfying results (\eg, 500). This may alleviate the poor reconstruction quality problem with the SD1.5 model.
  2. As for the collapse of SD2.1, you should be cautious about the prediction_type of the U-Net. That is to say, the checkpoint in the hugging face repo here is trained with velocity prediction (prediction_type="v_prediction" rather than the epsilon). While the implementation in our repo relies on the epsilon prediction, thus please ensure that you use their released model here trained with prediction_type="epsilon".

In addition, with the above suggestions, DDIM inversion still reconstructs the image with poor quality in some cases. Some following works like Null-text Inversion can further alleviate this. You can refer to #30 (comment) for more details.

Hope this can help you.

Thank you for your very detailed explanation, which has been very helpful to me. Another question is, why does SD1.4 only need 50 steps for inversion, while SD1.5 needs 500 steps?

Hi @shouwangzhe200, in my view, this is not true, and this phenomenon can be case-specific and uneven. Some images can be reconstructed well when only 50 steps are employed for inversion on SD1.5 too, and many images still cannot be inverted successfully with 500 steps on SD1.4. In general, more denoising steps help obtain higher reconstruction quality.

@ljzycmd Thank you for your great job! May I ask if for Inversion, it is necessary to use the training method on the epsilon prediction, or can I still use inversion on the training method on the v_prediction?

Hi @LiuShiyu95, the prediction type during the inversion process should comply with the prediction type during model training. You can use the DDIMInverseScheduler for the inversion: https://huggingface.co/docs/diffusers/api/schedulers/ddim_inverse.

Hope the above can help you.

I understand, thanks very much!