CompVis / latent-diffusion

High-Resolution Image Synthesis with Latent Diffusion Models

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Reproduction problem while training inpainting model

AlonzoLeeeooo opened this issue · comments

Thanks for the good work. I am trying to reproduce the diffusion model upon image inpainting task. The configuration file I uses is modified from models/ldm/inpainting_big/config.yaml. But the loss curve apppears to be quite weird. It converges too fast just after the warmup ends.
image

(Note that the warmup steps is 1000. The loss function value has come to a pretty low value at 1000 steps.)

Also, the inpainting result are poor in quality. This is one of my own test data. (trained on FFHQ dataset)

image

Does anyone encounter the same problem? I feel like this might be caused by a learning rate issue. Please answer and fix this problem. Thank you very much!

could you give me an example of your dataloader?
I am using the same config file giving the masked image as 3 channels and the image as 3 channels as well but I am getting this error

RuntimeError: Given groups=1, weight of size [256, 7, 3, 3], expected input[4, 6, 64, 64] to have 7 channels, but got 6 channels instead

You need to modify correpsonding part in ddpm.py. I solve the problem by concatenating mask, masked_image and image. Then the input has 7 channels as the configuration file gives. However it still seems hard to reproduce the official result. I have no idea how long the author has trained the model for. I have trained to entire 3 days and the inpainted result is still blur.

Do u mean in the dataloader, masked_image will have masked_image, mask and image?
or if u mean ddpm.py could u specify where please.

I fixed my problem, and after training I got the same output as you, just noise in the masked parts.

Hi guys, same problem here.

Do u mean in the dataloader, masked_image will have masked_image, mask and image? or if u mean ddpm.py could u specify where please.

Hi, could you tell me where to change? Thx a lot!

I suggest you spend some time understanding the whole codebase. It would be a lot easier if you understand the process of how Stable Diffusion works and how they implement this process. Althout it might take a while.

Generally speaking, you could modify ldm/models/diffusion/ddpm.py according to the script scripts/inpaint.py, which would be used while inference. In inpaint.py, line 79 we could see that the mask in each batch is downsampled to the same size as we pass the masked image through VQ model. It is implemented using nn.interpolate() and then concatenated with the encoded masked image. We should also keep the same mode of adding mask while we are training. So, the number of input channels of the whole U-Net should be 7 channels (image, 3 channels + masked image, 3 channels + mask, 1 channel = 7 channels), and mask and masked image should be concatenated with the input image in the same way while we are inference. In this manner, modifying correpsonding lines in ddpm.py should be able to work.

Thx for your reply, i've solved the problem already.

Hi @AlonzoLeeeooo,

Any update on your progress? Were you able to achieve good inpainting results on your custom dataset? If so, it would be great if you could share your training pipeline/ configurations.

Hi @zaryabmakram ,

I didn't successfully re-train the diffusion model. The results are always blur even if the model is trained for 3 days. Empirically this is caused by insufficient training. Afterwards, I notice that the reported GPU requirement in the supplementary materials of Stable Diffusion is 8 V100 GPUs. Due to limited computational resources, I have to give up the idea of reproducing it.

How about finetuning the provided inpainting_big checkpoint instead of training from scratch? Have you experimented with that? Do you think that might output good results on a custom dataset?

Also, are you aware of which dataset inpainting_big checkpoint has been trained on?

How about finetuning the provided inpainting_big checkpoint instead of training from scratch? Have you experimented with that? Do you think that might output good results on a custom dataset?

Also, are you aware of which dataset inpainting_big checkpoint has been trained on?

I haven't tried finetuning yet. But the idea should be able to work, theoretically. The reported training set is Places2 Standard. It is worth mentioning that using the provided inpainting_big is able to produce plausible results on most natural image cases. Maybe you could try it out.

I see, thanks! Well, I'll look into how I can try finetuning the inpainting checkpoint.

Can you kindly point me to the reference reporting that Places2 Standard dataset has been used for the inpainting model training? I'm unable to find that.

I see, thanks! Well, I'll look into how I can try finetuning the inpainting checkpoint.

Can you kindly point me to the reference reporting that Places2 Standard dataset has been used for the inpainting model training? I'm unable to find that.

It is at Table 15 in their supplementary materials. As for the supplementary materials, you could refer to https://openaccess.thecvf.com/content/CVPR2022/supplemental/Rombach_High-Resolution_Image_Synthesis_CVPR_2022_supplemental.pdf.

You need to modify correpsonding part in ddpm.py. I solve the problem by concatenating mask, masked_image and image. Then the input has 7 channels as the configuration file gives. However it still seems hard to reproduce the official result. I have no idea how long the author has trained the model for. I have trained to entire 3 days and the inpainted result is still blur.

Could you please tell me what encoder you use as cond_stage_config for training the inpainting model?

Hi @zaryabmakram ,

I didn't successfully re-train the diffusion model. The results are always blur even if the model is trained for 3 days. Empirically this is caused by insufficient training. Afterwards, I notice that the reported GPU requirement in the supplementary materials of Stable Diffusion is 8 V100 GPUs. Due to limited computational resources, I have to give up the idea of reproducing it.

你好!我对你的训练细节很感兴趣。
请问你训练了3天,使用了多大的batchsize,一共占用了多少显存,以及训练了多少个epoch呢?
我还没有进行复现,但我对latent-diffusion文中所说的”能够减少显存开销“很感兴趣,他真的能够通过latent space来达到减少显存的效果吗?
期待你的回复!

Hi @zaryabmakram ,
I didn't successfully re-train the diffusion model. The results are always blur even if the model is trained for 3 days. Empirically this is caused by insufficient training. Afterwards, I notice that the reported GPU requirement in the supplementary materials of Stable Diffusion is 8 V100 GPUs. Due to limited computational resources, I have to give up the idea of reproducing it.

你好!我对你的训练细节很感兴趣。 请问你训练了3天,使用了多大的batchsize,一共占用了多少显存,以及训练了多少个epoch呢? 我还没有进行复现,但我对latent-diffusion文中所说的”能够减少显存开销“很感兴趣,他真的能够通过latent space来达到减少显存的效果吗? 期待你的回复!

你好@DongyangHuLi

我设置的batch size是48,但是为了节省显存,我将model channels降低到了128,训练了应该有600k次迭代次数,使用的是两张3090来做训练。
对于latent-diffusion所说的“减少显存开销”的说法,应该是相比于DDPM说的,实际上需要的显卡需求依然不小,对于inpainting的setting来说,依然需要8张Tesla v100来训练。所以可以说我自身的算力条件依然不足以复现原论文的效果。以上供参考,希望能帮到你。

Hi @zaryabmakram ,
I didn't successfully re-train the diffusion model. The results are always blur even if the model is trained for 3 days. Empirically this is caused by insufficient training. Afterwards, I notice that the reported GPU requirement in the supplementary materials of Stable Diffusion is 8 V100 GPUs. Due to limited computational resources, I have to give up the idea of reproducing it.

你好!我对你的训练细节很感兴趣。 请问你训练了3天,使用了多大的batchsize,一共占用了多少显存,以及训练了多少个epoch呢? 我还没有进行复现,但我对latent-diffusion文中所说的”能够减少显存开销“很感兴趣,他真的能够通过latent space来达到减少显存的效果吗? 期待你的回复!

你好@DongyangHuLi

我设置的batch size是48,但是为了节省显存,我将model channels降低到了128,训练了应该有600k次迭代次数,使用的是两张3090来做训练。 对于latent-diffusion所说的“减少显存开销”的说法,应该是相比于DDPM说的,实际上需要的显卡需求依然不小,对于inpainting的setting来说,依然需要8张Tesla v100来训练。所以可以说我自身的算力条件依然不足以复现原论文的效果。以上供参考,希望能帮到你。

谢谢!那这么说,扩散模型没有足够的硬件资源是很难work得了了😔

Hi @zaryabmakram ,
I didn't successfully re-train the diffusion model. The results are always blur even if the model is trained for 3 days. Empirically this is caused by insufficient training. Afterwards, I notice that the reported GPU requirement in the supplementary materials of Stable Diffusion is 8 V100 GPUs. Due to limited computational resources, I have to give up the idea of reproducing it.

你好!我对你的训练细节很感兴趣。 请问你训练了3天,使用了多大的batchsize,一共占用了多少显存,以及训练了多少个epoch呢? 我还没有进行复现,但我对latent-diffusion文中所说的”能够减少显存开销“很感兴趣,他真的能够通过latent space来达到减少显存的效果吗? 期待你的回复!

你好@DongyangHuLi
我设置的batch size是48,但是为了节省显存,我将model channels降低到了128,训练了应该有600k次迭代次数,使用的是两张3090来做训练。 对于latent-diffusion所说的“减少显存开销”的说法,应该是相比于DDPM说的,实际上需要的显卡需求依然不小,对于inpainting的setting来说,依然需要8张Tesla v100来训练。所以可以说我自身的算力条件依然不足以复现原论文的效果。以上供参考,希望能帮到你。

谢谢!那这么说,扩散模型没有足够的硬件资源是很难work得了了😔

是的,基本就是烧钱才能做的research🙍‍♂️

Thanks for the good work. I am trying to reproduce the diffusion model upon image inpainting task. The configuration file I uses is modified from models/ldm/inpainting_big/config.yaml. But the loss curve apppears to be quite weird. It converges too fast just after the warmup ends. image

(Note that the warmup steps is 1000. The loss function value has come to a pretty low value at 1000 steps.)

Also, the inpainting result are poor in quality. This is one of my own test data. (trained on FFHQ dataset)

image

Does anyone encounter the same problem? I feel like this might be caused by a learning rate issue. Please answer and fix this problem. Thank you very much!

Could you share the inpainting training code? Thank you!

Hi @zaryabmakram ,
I didn't successfully re-train the diffusion model. The results are always blur even if the model is trained for 3 days. Empirically this is caused by insufficient training. Afterwards, I notice that the reported GPU requirement in the supplementary materials of Stable Diffusion is 8 V100 GPUs. Due to limited computational resources, I have to give up the idea of reproducing it.

你好!我对你的训练细节很感兴趣。 请问你训练了3天,使用了多大的batchsize,一共占用了多少显存,以及训练了多少个epoch呢? 我还没有进行复现,但我对latent-diffusion文中所说的”能够减少显存开销“很感兴趣,他真的能够通过latent space来达到减少显存的效果吗? 期待你的回复!

你好@DongyangHuLi
我设置的batch size是48,但是为了节省显存,我将model channels降低到了128,训练了应该有600k次迭代次数,使用的是两张3090来做训练。 对于latent-diffusion所说的“减少显存开销”的说法,应该是相比于DDPM说的,实际上需要的显卡需求依然不小,对于inpainting的setting来说,依然需要8张Tesla v100来训练。所以可以说我自身的算力条件依然不足以复现原论文的效果。以上供参考,希望能帮到你。

谢谢!那这么说,扩散模型没有足够的硬件资源是很难work得了了😔

是的,基本就是烧钱才能做的research🙍‍♂️

兄弟也是ustc的?加个微信交流一下?我微信号Kiss_The_Rain8,麻烦加一下哈,向你请教一下

Hi @AlonzoLeeeooo
Do I need to finetune the autoencoder separately (stage 1) on my custom dataset and then finetune the inpainting_big model by modifying the input in ddpm.py as in inpaint.py (stage 2) on my custom dataset? Or only Stage 2 would work. Please help.

Hi @AlonzoLeeeooo Do I need to finetune the autoencoder separately (stage 1) on my custom dataset and then finetune the inpainting_big model by modifying the input in ddpm.py as in inpaint.py (stage 2) on my custom dataset? Or only Stage 2 would work. Please help.

Hi @rayush7 ,
As far as I am concerned, you don't need to tune the model parameters of the VQ model (stage 1). Since the official one is trained on open images dataset, it should be sufficient to encode most of the images. Only finetuning stage 2 should be able to work.

Thank you @AlonzoLeeeooo
I will give it a try.

how to prepare the data for inpaint?

The papers mentions that the data preparation step is same as in LaMa.
https://github.com/advimman/lama

I suggest you spend some time understanding the whole codebase. It would be a lot easier if you understand the process of how Stable Diffusion works and how they implement this process. Althout it might take a while.

Generally speaking, you could modify ldm/models/diffusion/ddpm.py according to the script scripts/inpaint.py, which would be used while inference. In inpaint.py, line 79 we could see that the mask in each batch is downsampled to the same size as we pass the masked image through VQ model. It is implemented using nn.interpolate() and then concatenated with the encoded masked image. We should also keep the same mode of adding mask while we are training. So, the number of input channels of the whole U-Net should be 7 channels (image, 3 channels + masked image, 3 channels + mask, 1 channel = 7 channels), and mask and masked image should be concatenated with the input image in the same way while we are inference. In this manner, modifying correpsonding lines in ddpm.py should be able to work.

Could I have a look at your modified code for this part, thank you very much if I could ! ! !

Hi @mumingerlai ,

I really would like to help but since I was working on another project related to the same codebase, a huge amount of modifications upon the codebase have been made and it would be quite difficult for me to retrieve the corresponding parts of inpainting. Sorry for not able to do you the favor.

But if there is any other problem about the modification, please feel free to discuss in this issue and I would try my best to recall and answer.

Regards,
Chang

Hi @mumingerlai ,

I really would like to help but since I was working on another project related to the same codebase, a huge amount of modifications upon the codebase have been made and it would be quite difficult for me to retrieve the corresponding parts of inpainting. Sorry for not able to do you the favor.

But if there is any other problem about the modification, please feel free to discuss in this issue and I would try my best to recall and answer.

Regards, Chang

I feel very happy for your reply. I have modified inpainting. py and concated images, masked images, and masks. It seems that they can also run normally! Anyway, thank you very much!

commented

Hi @mumingerlai ,
I really would like to help but since I was working on another project related to the same codebase, a huge amount of modifications upon the codebase have been made and it would be quite difficult for me to retrieve the corresponding parts of inpainting. Sorry for not able to do you the favor.
But if there is any other problem about the modification, please feel free to discuss in this issue and I would try my best to recall and answer.
Regards, Chang

I feel very happy for your reply. I have modified inpainting. py and concated images, masked images, and masks. It seems that they can also run normally! Anyway, thank you very much!

Hello, could I have a look at your modification method and data config for inpainting train? These tasks are difficult for me. Thank you very much!

commented

@AlonzoLeeeooo
Hi, during training is your diffusion process only on the image latent features or on the whole [image+masked_image+mask] feature maps? Thanks!

Hi @shensongli ,
The configuration file is the same as the official one. You may find it in the same folder as the downloaded model weights inpaint_big. Additionally, you may need to write a dataset.py and modify the data parts in the .yaml configuration file.

Here is my implementation of an example. I follow the free-form mask setting in DeepFill-v2. Hope this would help!

import os, sys, yaml, pickle, shutil, tarfile, glob
import cv2
import albumentations
import PIL
import numpy as np
import torchvision.transforms.functional as TF
from omegaconf import OmegaConf
from functools import partial
from PIL import Image
from tqdm import tqdm
import lmdb

import torch
from torch.utils.data import Dataset, Subset

class InpaintingTrain(Dataset):
    def __init__(self, size, data_root, config=None):
        self.size = size
        self.config = config or OmegaConf.create()
        self.image_flist = self.get_files_from_txt(data_root)


    def generate_stroke_mask(self, im_size, parts=4, maxVertex=25, maxLength=80, maxBrushWidth=40, maxAngle=360):
        
        mask = np.zeros((im_size[0], im_size[1], 1), dtype=np.float32)
        for i in range(parts):
            mask = mask + self.np_free_form_mask(maxVertex, maxLength, maxBrushWidth, maxAngle, im_size[0], im_size[1])
        mask = np.minimum(mask, 1.0)

        return mask


    def np_free_form_mask(self, maxVertex, maxLength, maxBrushWidth, maxAngle, h, w):

        mask = np.zeros((h, w, 1), np.float32)
        numVertex = np.random.randint(maxVertex + 1)
        startY = np.random.randint(h)
        startX = np.random.randint(w)
        brushWidth = 0
        for i in range(numVertex):
            angle = np.random.randint(maxAngle + 1)
            angle = angle / 360.0 * 2 * np.pi
            if i % 2 == 0:
                angle = 2 * np.pi - angle
            length = np.random.randint(maxLength + 1)
            brushWidth = np.random.randint(10, maxBrushWidth + 1) // 2 * 2
            nextY = startY + length * np.cos(angle)
            nextX = startX + length * np.sin(angle)
            nextY = np.maximum(np.minimum(nextY, h - 1), 0).astype(np.int)
            nextX = np.maximum(np.minimum(nextX, w - 1), 0).astype(np.int)
            cv2.line(mask, (startY, startX), (nextY, nextX), 1, brushWidth)
            cv2.circle(mask, (startY, startX), brushWidth // 2, 2)
            startY, startX = nextY, nextX
        cv2.circle(mask, (startY, startX), brushWidth // 2, 2)
        
        return mask


    def get_files_from_txt(self, path):

        file_list = []
        f = open(path)
        for line in f.readlines():
            line = line.strip("\n")
            file_list.append(line)
            sys.stdout.flush()
        f.close()

        return file_list


    def get_files(self, path):

        # read a folder, return the complete path
        ret = []
        for root, dirs, files in os.walk(path):
            for filespath in files:
                ret.append(os.path.join(root, filespath))

        return ret


    def __len__(self):
        return len(self.image_flist)


    def __getitem__(self, i):
        
        image = np.array(Image.open(self.image_flist[i]).convert("RGB"))
        image = cv2.resize(image, (self.size, self.size))
        image = image.astype(np.float32) / 255.0
        image = torch.from_numpy(image)

        mask = self.generate_stroke_mask([self.size, self.size])
        mask[mask < 0.5] = 0
        mask[mask >= 0.5] = 1
        mask = torch.from_numpy(mask)

        masked_image = (1 - mask) * image

        batch = {"image": image, "mask": mask, "masked_image": masked_image}
        for k in batch:
            batch[k] = batch[k] * 2.0 - 1.0

        return batch

Regards,
Chang

@AlonzoLeeeooo Hi, during training is your diffusion process only on the image latent features or on the whole [image+masked_image+mask] feature maps? Thanks!

Hi @wtliao ,

I think the diffusion process is implemented only on the image latent, which is the same as their official SD v2.0-inpainting. I guess both mask and masked image could be regarded as a strong condition?

Regards
Chang

commented

@AlonzoLeeeooo
谢谢解答,方便价格vx讨论交流吗,我的wtliao

commented

Hi @shensongli , The configuration file is the same as the official one. You may find it in the same folder as the downloaded model weights inpaint_big. Additionally, you may need to write a dataset.py and modify the data parts in the .yaml configuration file.

Here is my implementation of an example. I follow the free-form mask setting in DeepFill-v2. Hope this would help!

import os, sys, yaml, pickle, shutil, tarfile, glob
import cv2
import albumentations
import PIL
import numpy as np
import torchvision.transforms.functional as TF
from omegaconf import OmegaConf
from functools import partial
from PIL import Image
from tqdm import tqdm
import lmdb

import torch
from torch.utils.data import Dataset, Subset

class InpaintingTrain(Dataset):
    def __init__(self, size, data_root, config=None):
        self.size = size
        self.config = config or OmegaConf.create()
        self.image_flist = self.get_files_from_txt(data_root)


    def generate_stroke_mask(self, im_size, parts=4, maxVertex=25, maxLength=80, maxBrushWidth=40, maxAngle=360):
        
        mask = np.zeros((im_size[0], im_size[1], 1), dtype=np.float32)
        for i in range(parts):
            mask = mask + self.np_free_form_mask(maxVertex, maxLength, maxBrushWidth, maxAngle, im_size[0], im_size[1])
        mask = np.minimum(mask, 1.0)

        return mask


    def np_free_form_mask(self, maxVertex, maxLength, maxBrushWidth, maxAngle, h, w):

        mask = np.zeros((h, w, 1), np.float32)
        numVertex = np.random.randint(maxVertex + 1)
        startY = np.random.randint(h)
        startX = np.random.randint(w)
        brushWidth = 0
        for i in range(numVertex):
            angle = np.random.randint(maxAngle + 1)
            angle = angle / 360.0 * 2 * np.pi
            if i % 2 == 0:
                angle = 2 * np.pi - angle
            length = np.random.randint(maxLength + 1)
            brushWidth = np.random.randint(10, maxBrushWidth + 1) // 2 * 2
            nextY = startY + length * np.cos(angle)
            nextX = startX + length * np.sin(angle)
            nextY = np.maximum(np.minimum(nextY, h - 1), 0).astype(np.int)
            nextX = np.maximum(np.minimum(nextX, w - 1), 0).astype(np.int)
            cv2.line(mask, (startY, startX), (nextY, nextX), 1, brushWidth)
            cv2.circle(mask, (startY, startX), brushWidth // 2, 2)
            startY, startX = nextY, nextX
        cv2.circle(mask, (startY, startX), brushWidth // 2, 2)
        
        return mask


    def get_files_from_txt(self, path):

        file_list = []
        f = open(path)
        for line in f.readlines():
            line = line.strip("\n")
            file_list.append(line)
            sys.stdout.flush()
        f.close()

        return file_list


    def get_files(self, path):

        # read a folder, return the complete path
        ret = []
        for root, dirs, files in os.walk(path):
            for filespath in files:
                ret.append(os.path.join(root, filespath))

        return ret


    def __len__(self):
        return len(self.image_flist)


    def __getitem__(self, i):
        
        image = np.array(Image.open(self.image_flist[i]).convert("RGB"))
        image = cv2.resize(image, (self.size, self.size))
        image = image.astype(np.float32) / 255.0
        image = torch.from_numpy(image)

        mask = self.generate_stroke_mask([self.size, self.size])
        mask[mask < 0.5] = 0
        mask[mask >= 0.5] = 1
        mask = torch.from_numpy(mask)

        masked_image = (1 - mask) * image

        batch = {"image": image, "mask": mask, "masked_image": masked_image}
        for k in batch:
            batch[k] = batch[k] * 2.0 - 1.0

        return batch

Regards, Chang

I feel very happy for your reply.Thank you very much!

commented

Do u mean in the dataloader, masked_image will have masked_image, mask and image? or if u mean ddpm.py could u specify where please.

Hi, could you tell me where to change? Thx a lot!

hello,could you show me the specific modified code? I just started dl and read the DDPM code for a long time, but I don't know how to add it.Thank you very much!

Hi @zaryabmakram ,
I didn't successfully re-train the diffusion model. The results are always blur even if the model is trained for 3 days. Empirically this is caused by insufficient training. Afterwards, I notice that the reported GPU requirement in the supplementary materials of Stable Diffusion is 8 V100 GPUs. Due to limited computational resources, I have to give up the idea of reproducing it.

你好!我对你的训练细节很感兴趣。 请问你训练了3天,使用了多大的batchsize,一共占用了多少显存,以及训练了多少个epoch呢? 我还没有进行复现,但我对latent-diffusion文中所说的”能够减少显存开销“很感兴趣,他真的能够通过latent space来达到减少显存的效果吗? 期待你的回复!

你好@DongyangHuLi
我设置的batch size是48,但是为了节省显存,我将model channels降低到了128,训练了应该有600k次迭代次数,使用的是两张3090来做训练。 对于latent-diffusion所说的“减少显存开销”的说法,应该是相比于DDPM说的,实际上需要的显卡需求依然不小,对于inpainting的setting来说,依然需要8张Tesla v100来训练。所以可以说我自身的算力条件依然不足以复现原论文的效果。以上供参考,希望能帮到你。

谢谢!那这么说,扩散模型没有足够的硬件资源是很难work得了了😔

是的,基本就是烧钱才能做的research🙍‍♂️

兄弟也是ustc的?加个微信交流一下?我微信号Kiss_The_Rain8,麻烦加一下哈,向你请教一下

你的微信号加不了

你好,想问一下,这个损失函数怎么调出来的,还是说要自己写回调函数?

你好,想问一下,这个损失函数怎么调出来的,还是说要自己写回调函数?

损失函数应该直接用diffusion model的MSE loss就可以了吧?你说的回调函数具体指的是是什么呢?

你好,想问一下,这个损失函数怎么调出来的,还是说要自己写回调函数?

损失函数应该直接用diffusion model的MSE loss就可以了吧?你说的回调函数具体指的是是什么呢?

我可能没表述清楚,我是想问这个图有没有简单的方法在训练结束后直接显示,还是说只能记录每个epoch的结果,不清楚是不是main里面的wandb之类的

你好,想问一下,这个损失函数怎么调出来的,还是说要自己写回调函数?

损失函数应该直接用diffusion model的MSE loss就可以了吧?你说的回调函数具体指的是是什么呢?

我可能没表述清楚,我是想问这个图有没有简单的方法在训练结束后直接显示,还是说只能记录每个epoch的结果,不清楚是不是main里面的wandb之类的

你指的是训练的曲线图吧?stable diffusion的codebase里面默认有用tensorboard同步记录loss的,wandb的话不清楚官方有没有实现,没有的话可能需要你动动手改一下ddpm.pywandb.log()加进去。

commented

@AlonzoLeeeooo 你好,我想请问一下,训练inpainting的时候, 有没有设置unconditional probability,就是以一定的概率,10%,把text设置为空,以此保证模型unconditional的能力。另外lr,设置多大合适,scale_lr设置为false吧?我训练出来的结果很糟糕。

@AlonzoLeeeooo 你好,我想请问一下,训练inpainting的时候, 有没有设置unconditional probability,就是以一定的概率,10%,把text设置为空,以此保证模型unconditional的能力。另外lr,设置多大合适,scale_lr设置为false吧?我训练出来的结果很糟糕。

这个应该是要设置的,设置的原因是因为采样的时候可以用classifier-free guidance,classifier-free guidance据观察对diffusion model的采样帮助很大。lr这个得调,具体可能跟你的训练数据有关系,你也可以参考一下官方paper补充材料里面提供的一些超参数设置(实验没记错的话应该是在Places2上面做的),他这个scale_lr没记错的话是根据你的训练用的显卡数量把lr平均。具体训练效果的话你训练了多久?有可能是没有训够,做inpainting有可能需要训一周以上(具体我没有尝试过,一个可供参考的数据是在CelebA人脸的无条件生成任务下训好一个模型大约需要4天左右)

commented

@AlonzoLeeeooo 感谢回答。 我用64张卡, batchsize每张卡4, 训练了大概200k iterations,大概一周。出现了几个问题,scale_lr=True开始不久就直接崩了,输出只有noise。另一个问题是训练iterations越多,生成的背景就越简单。这个我百思不得其解。我用的是stable-diffusion的官方代码,Unet的input channel是9, autoencoder输出channel是4。
你提到的“scale_lr没记错的话是根据你的训练用的显卡数量把lr平均”, 我看code好像是直接lrbatchsize#GPs。这样原先5e-5d的lr,累积后就是1.28e-3,模型好像有点敏感。另外, finetune过程中,clip是用同一lr一起fine-tune吗?

@AlonzoLeeeooo 感谢回答。 我用64张卡, batchsize每张卡4, 训练了大概200k iterations,大概一周。出现了几个问题,scale_lr=True开始不久就直接崩了,输出只有noise。另一个问题是训练iterations越多,生成的背景就越简单。这个我百思不得其解。我用的是stable-diffusion的官方代码,Unet的input channel是9, autoencoder输出channel是4。
你提到的“scale_lr没记错的话是根据你的训练用的显卡数量把lr平均”, 我看code好像是直接lrbatchsize#GPs。这样原先5e-5d的lr,累积后就是1.28e-3,模型好像有点敏感。另外, finetune过程中,clip是用同一lr一起fine-tune吗?

是在Places2上做的训练吗?diffusion比较吃数据量,太小的数据集可能会做不好。按理来说这个设定应该能得到一个初步结果,建议你最好检查一下有没有bug,scale_lr 这个是我记错了,不过1.28e-3可能有点大了,训练可能会飞很正常。训inpainting不用clip吧,你用的config文件是inpaint_big那个吗?

commented

@AlonzoLeeeooo 感谢回答。 我用64张卡, batchsize每张卡4, 训练了大概200k iterations,大概一周。出现了几个问题,scale_lr=True开始不久就直接崩了,输出只有noise。另一个问题是训练iterations越多,生成的背景就越简单。这个我百思不得其解。我用的是stable-diffusion的官方代码,Unet的input channel是9, autoencoder输出channel是4。
你提到的“scale_lr没记错的话是根据你的训练用的显卡数量把lr平均”, 我看code好像是直接lr_batchsize_#GPs。这样原先5e-5d的lr,累积后就是1.28e-3,模型好像有点敏感。另外, finetune过程中,clip是用同一lr一起fine-tune吗?

是在Places2上做的训练吗?diffusion比较吃数据量,太小的数据集可能会做不好。按理来说这个设定应该能得到一个初步结果,建议你最好检查一下有没有bug,scale_lr 这个是我记错了,不过1.28e-3可能有点大了,训练可能会飞很正常。训inpainting不用clip吧,你用的config文件是inpaint_big那个吗?

我用的stable diffusion model v2的模型,inpainting的输入包括了text。 我在laion 4.5 dataset上训练的。

@AlonzoLeeeooo 你好,我想请问一下,训练inpainting的时候, 有没有设置unconditional probability,就是以一定的概率,10%,把text设置为空,以此保证模型unconditional的能力。另外lr,设置多大合适,scale_lr设置为false吧?我训练出来的结果很糟糕。

这个应该是要设置的,设置的原因是因为采样的时候可以用classifier-free guidance,classifier-free guidance据观察对diffusion model的采样帮助很大。lr这个得调,具体可能跟你的训练数据有关系,你也可以参考一下官方paper补充材料里面提供的一些超参数设置(实验没记错的话应该是在Places2上面做的),他这个scale_lr没记错的话是根据你的训练用的显卡数量把lr平均。具体训练效果的话你训练了多久?有可能是没有训够,做inpainting有可能需要训一周以上(具体我没有尝试过,一个可供参考的数据是在CelebA人脸的无条件生成任务下训好一个模型大约需要4天左右)

@AlonzoLeeeooo 您好,请问在classifier-free guidance的sample中,您知道参数unconditional_conditioning应该怎样设置么,如果是none(ddim.py中),只调整unconditional_guidance_scale的话,好像并没有其作用,见下图:
image
同时,根据latent-imagenet-diffusion.ipynb这个参考,class条件会直接设成超出的类的数字,然后变成embedding传给unconditional_conditioning,但我不太清楚,如果是Image或者text,应该怎么做呢?
image

@AlonzoLeeeooo 你好,我想请问一下,训练inpainting的时候, 有没有设置unconditional probability,就是以一定的概率,10%,把text设置为空,以此保证模型unconditional的能力。另外lr,设置多大合适,scale_lr设置为false吧?我训练出来的结果很糟糕。

这个应该是要设置的,设置的原因是因为采样的时候可以用classifier-free guidance,classifier-free guidance据观察对diffusion model的采样帮助很大。lr这个得调,具体可能跟你的训练数据有关系,你也可以参考一下官方paper补充材料里面提供的一些超参数设置(实验没记错的话应该是在Places2上面做的),他这个scale_lr没记错的话是根据你的训练用的显卡数量把lr平均。具体训练效果的话你训练了多久?有可能是没有训够,做inpainting有可能需要训一周以上(具体我没有尝试过,一个可供参考的数据是在CelebA人脸的无条件生成任务下训好一个模型大约需要4天左右)

@AlonzoLeeeooo 您好,请问在classifier-free guidance的sample中,您知道参数unconditional_conditioning应该怎样设置么,如果是none(ddim.py中),只调整unconditional_guidance_scale的话,好像并没有其作用,见下图: image 同时,根据latent-imagenet-diffusion.ipynb这个参考,class条件会直接设成超出的类的数字,然后变成embedding传给unconditional_conditioning,但我不太清楚,如果是Image或者text,应该怎么做呢? image

你好,这里设置为1的话默认是不使用classifier-free guidance,然后理论上越大样本的质量越好,没记错的话官方的默认设置是7.5。图像的话我不太清楚,你可以参考一些img2img的工作,text的话应该是设置一个空字符串['']