Reproduction problem while training inpainting model

Question

Reproduction problem while training inpainting model

AlonzoLeeeooo opened this issue 2 years ago · comments

Thanks for the good work. I am trying to reproduce the diffusion model upon image inpainting task. The configuration file I uses is modified from models/ldm/inpainting_big/config.yaml. But the loss curve apppears to be quite weird. It converges too fast just after the warmup ends.

(Note that the warmup steps is 1000. The loss function value has come to a pretty low value at 1000 steps.)

Also, the inpainting result are poor in quality. This is one of my own test data. (trained on FFHQ dataset)

Does anyone encounter the same problem? I feel like this might be caused by a learning rate issue. Please answer and fix this problem. Thank you very much!

Ahmed Marey · Answer 1 · Thu Oct 06 2022 18:21:11 GMT+0800 (China Standard Time)

could you give me an example of your dataloader?
I am using the same config file giving the masked image as 3 channels and the image as 3 channels as well but I am getting this error

RuntimeError: Given groups=1, weight of size [256, 7, 3, 3], expected input[4, 6, 64, 64] to have 7 channels, but got 6 channels instead

USTC-liuchang · Answer 2 · Thu Oct 06 2022 21:26:44 GMT+0800 (China Standard Time)

You need to modify correpsonding part in ddpm.py. I solve the problem by concatenating mask, masked_image and image. Then the input has 7 channels as the configuration file gives. However it still seems hard to reproduce the official result. I have no idea how long the author has trained the model for. I have trained to entire 3 days and the inpainted result is still blur.

Ahmed Marey · Answer 3 · Thu Oct 06 2022 21:30:15 GMT+0800 (China Standard Time)

Do u mean in the dataloader, masked_image will have masked_image, mask and image?
or if u mean ddpm.py could u specify where please.

Ahmed Marey · Answer 4 · Thu Oct 13 2022 20:41:53 GMT+0800 (China Standard Time)

I fixed my problem, and after training I got the same output as you, just noise in the masked parts.

Double_J_37 · Answer 5 · Tue Oct 18 2022 19:00:57 GMT+0800 (China Standard Time)

Hi guys, same problem here.

ImmortalSdm · Answer 6 · Thu Jan 12 2023 00:04:40 GMT+0800 (China Standard Time)

Do u mean in the dataloader, masked_image will have masked_image, mask and image? or if u mean ddpm.py could u specify where please.

Hi, could you tell me where to change? Thx a lot!

USTC-liuchang · Answer 7 · Fri Jan 13 2023 21:58:55 GMT+0800 (China Standard Time)

I suggest you spend some time understanding the whole codebase. It would be a lot easier if you understand the process of how Stable Diffusion works and how they implement this process. Althout it might take a while.

Generally speaking, you could modify ldm/models/diffusion/ddpm.py according to the script scripts/inpaint.py, which would be used while inference. In inpaint.py, line 79 we could see that the mask in each batch is downsampled to the same size as we pass the masked image through VQ model. It is implemented using nn.interpolate() and then concatenated with the encoded masked image. We should also keep the same mode of adding mask while we are training. So, the number of input channels of the whole U-Net should be 7 channels (image, 3 channels + masked image, 3 channels + mask, 1 channel = 7 channels), and mask and masked image should be concatenated with the input image in the same way while we are inference. In this manner, modifying correpsonding lines in ddpm.py should be able to work.

ImmortalSdm · Answer 8 · Fri Jan 13 2023 22:12:43 GMT+0800 (China Standard Time)

Thx for your reply, i've solved the problem already.

Zaryab Muhammad Akram · Answer 9 · Wed Feb 08 2023 13:57:47 GMT+0800 (China Standard Time)

Hi @AlonzoLeeeooo,

Any update on your progress? Were you able to achieve good inpainting results on your custom dataset? If so, it would be great if you could share your training pipeline/ configurations.

USTC-liuchang · Answer 10 · Wed Feb 08 2023 14:03:31 GMT+0800 (China Standard Time)

Hi @zaryabmakram ,

I didn't successfully re-train the diffusion model. The results are always blur even if the model is trained for 3 days. Empirically this is caused by insufficient training. Afterwards, I notice that the reported GPU requirement in the supplementary materials of Stable Diffusion is 8 V100 GPUs. Due to limited computational resources, I have to give up the idea of reproducing it.

Zaryab Muhammad Akram · Answer 11 · Wed Feb 08 2023 14:08:14 GMT+0800 (China Standard Time)

How about finetuning the provided inpainting_big checkpoint instead of training from scratch? Have you experimented with that? Do you think that might output good results on a custom dataset?

Also, are you aware of which dataset inpainting_big checkpoint has been trained on?

USTC-liuchang · Answer 12 · Wed Feb 08 2023 14:13:02 GMT+0800 (China Standard Time)

How about finetuning the provided inpainting_big checkpoint instead of training from scratch? Have you experimented with that? Do you think that might output good results on a custom dataset?

Also, are you aware of which dataset inpainting_big checkpoint has been trained on?

I haven't tried finetuning yet. But the idea should be able to work, theoretically. The reported training set is Places2 Standard. It is worth mentioning that using the provided inpainting_big is able to produce plausible results on most natural image cases. Maybe you could try it out.

Zaryab Muhammad Akram · Answer 13 · Wed Feb 08 2023 14:30:19 GMT+0800 (China Standard Time)

I see, thanks! Well, I'll look into how I can try finetuning the inpainting checkpoint.

Can you kindly point me to the reference reporting that Places2 Standard dataset has been used for the inpainting model training? I'm unable to find that.

USTC-liuchang · Answer 14 · Wed Feb 08 2023 14:34:57 GMT+0800 (China Standard Time)

I see, thanks! Well, I'll look into how I can try finetuning the inpainting checkpoint.

Can you kindly point me to the reference reporting that Places2 Standard dataset has been used for the inpainting model training? I'm unable to find that.

It is at Table 15 in their supplementary materials. As for the supplementary materials, you could refer to https://openaccess.thecvf.com/content/CVPR2022/supplemental/Rombach_High-Resolution_Image_Synthesis_CVPR_2022_supplemental.pdf.

aleksmirosh · Answer 15 · Thu Feb 09 2023 02:00:40 GMT+0800 (China Standard Time)

You need to modify correpsonding part in ddpm.py. I solve the problem by concatenating mask, masked_image and image. Then the input has 7 channels as the configuration file gives. However it still seems hard to reproduce the official result. I have no idea how long the author has trained the model for. I have trained to entire 3 days and the inpainted result is still blur.

Could you please tell me what encoder you use as cond_stage_config for training the inpainting model?

DongyangHuLi · Answer 16 · Mon Feb 20 2023 15:04:02 GMT+0800 (China Standard Time)

Hi @zaryabmakram ,

I didn't successfully re-train the diffusion model. The results are always blur even if the model is trained for 3 days. Empirically this is caused by insufficient training. Afterwards, I notice that the reported GPU requirement in the supplementary materials of Stable Diffusion is 8 V100 GPUs. Due to limited computational resources, I have to give up the idea of reproducing it.

你好！我对你的训练细节很感兴趣。
请问你训练了3天，使用了多大的batchsize，一共占用了多少显存，以及训练了多少个epoch呢？
我还没有进行复现，但我对latent-diffusion文中所说的”能够减少显存开销“很感兴趣，他真的能够通过latent space来达到减少显存的效果吗？
期待你的回复！

USTC-liuchang · Answer 17 · Mon Feb 20 2023 15:25:32 GMT+0800 (China Standard Time)

Hi @zaryabmakram ,
I didn't successfully re-train the diffusion model. The results are always blur even if the model is trained for 3 days. Empirically this is caused by insufficient training. Afterwards, I notice that the reported GPU requirement in the supplementary materials of Stable Diffusion is 8 V100 GPUs. Due to limited computational resources, I have to give up the idea of reproducing it.

你好！我对你的训练细节很感兴趣。请问你训练了3天，使用了多大的batchsize，一共占用了多少显存，以及训练了多少个epoch呢？我还没有进行复现，但我对latent-diffusion文中所说的”能够减少显存开销“很感兴趣，他真的能够通过latent space来达到减少显存的效果吗？期待你的回复！

你好@DongyangHuLi，

我设置的batch size是48，但是为了节省显存，我将model channels降低到了128，训练了应该有600k次迭代次数，使用的是两张3090来做训练。
对于latent-diffusion所说的“减少显存开销”的说法，应该是相比于DDPM说的，实际上需要的显卡需求依然不小，对于inpainting的setting来说，依然需要8张Tesla v100来训练。所以可以说我自身的算力条件依然不足以复现原论文的效果。以上供参考，希望能帮到你。

DongyangHuLi · Answer 18 · Mon Feb 20 2023 15:28:16 GMT+0800 (China Standard Time)

Hi @zaryabmakram ,
I didn't successfully re-train the diffusion model. The results are always blur even if the model is trained for 3 days. Empirically this is caused by insufficient training. Afterwards, I notice that the reported GPU requirement in the supplementary materials of Stable Diffusion is 8 V100 GPUs. Due to limited computational resources, I have to give up the idea of reproducing it.

你好！我对你的训练细节很感兴趣。请问你训练了3天，使用了多大的batchsize，一共占用了多少显存，以及训练了多少个epoch呢？我还没有进行复现，但我对latent-diffusion文中所说的”能够减少显存开销“很感兴趣，他真的能够通过latent space来达到减少显存的效果吗？期待你的回复！

你好@DongyangHuLi，

我设置的batch size是48，但是为了节省显存，我将model channels降低到了128，训练了应该有600k次迭代次数，使用的是两张3090来做训练。对于latent-diffusion所说的“减少显存开销”的说法，应该是相比于DDPM说的，实际上需要的显卡需求依然不小，对于inpainting的setting来说，依然需要8张Tesla v100来训练。所以可以说我自身的算力条件依然不足以复现原论文的效果。以上供参考，希望能帮到你。

谢谢！那这么说，扩散模型没有足够的硬件资源是很难work得了了😔

USTC-liuchang · Answer 19 · Mon Feb 20 2023 15:35:25 GMT+0800 (China Standard Time)

Hi @zaryabmakram ,
I didn't successfully re-train the diffusion model. The results are always blur even if the model is trained for 3 days. Empirically this is caused by insufficient training. Afterwards, I notice that the reported GPU requirement in the supplementary materials of Stable Diffusion is 8 V100 GPUs. Due to limited computational resources, I have to give up the idea of reproducing it.

你好！我对你的训练细节很感兴趣。请问你训练了3天，使用了多大的batchsize，一共占用了多少显存，以及训练了多少个epoch呢？我还没有进行复现，但我对latent-diffusion文中所说的”能够减少显存开销“很感兴趣，他真的能够通过latent space来达到减少显存的效果吗？期待你的回复！

你好@DongyangHuLi，
我设置的batch size是48，但是为了节省显存，我将model channels降低到了128，训练了应该有600k次迭代次数，使用的是两张3090来做训练。对于latent-diffusion所说的“减少显存开销”的说法，应该是相比于DDPM说的，实际上需要的显卡需求依然不小，对于inpainting的setting来说，依然需要8张Tesla v100来训练。所以可以说我自身的算力条件依然不足以复现原论文的效果。以上供参考，希望能帮到你。

谢谢！那这么说，扩散模型没有足够的硬件资源是很难work得了了😔

是的，基本就是烧钱才能做的research🙍‍♂️

xiangli93 · Answer 20 · Sun Mar 19 2023 18:33:23 GMT+0800 (China Standard Time)

Thanks for the good work. I am trying to reproduce the diffusion model upon image inpainting task. The configuration file I uses is modified from models/ldm/inpainting_big/config.yaml. But the loss curve apppears to be quite weird. It converges too fast just after the warmup ends.

(Note that the warmup steps is 1000. The loss function value has come to a pretty low value at 1000 steps.)

Also, the inpainting result are poor in quality. This is one of my own test data. (trained on FFHQ dataset)

Does anyone encounter the same problem? I feel like this might be caused by a learning rate issue. Please answer and fix this problem. Thank you very much!

Could you share the inpainting training code? Thank you!

ustczhouyu · Answer 21 · Tue Mar 28 2023 10:37:11 GMT+0800 (China Standard Time)

Hi @zaryabmakram ,
I didn't successfully re-train the diffusion model. The results are always blur even if the model is trained for 3 days. Empirically this is caused by insufficient training. Afterwards, I notice that the reported GPU requirement in the supplementary materials of Stable Diffusion is 8 V100 GPUs. Due to limited computational resources, I have to give up the idea of reproducing it.

你好！我对你的训练细节很感兴趣。请问你训练了3天，使用了多大的batchsize，一共占用了多少显存，以及训练了多少个epoch呢？我还没有进行复现，但我对latent-diffusion文中所说的”能够减少显存开销“很感兴趣，他真的能够通过latent space来达到减少显存的效果吗？期待你的回复！

你好@DongyangHuLi，
我设置的batch size是48，但是为了节省显存，我将model channels降低到了128，训练了应该有600k次迭代次数，使用的是两张3090来做训练。对于latent-diffusion所说的“减少显存开销”的说法，应该是相比于DDPM说的，实际上需要的显卡需求依然不小，对于inpainting的setting来说，依然需要8张Tesla v100来训练。所以可以说我自身的算力条件依然不足以复现原论文的效果。以上供参考，希望能帮到你。

谢谢！那这么说，扩散模型没有足够的硬件资源是很难work得了了😔

是的，基本就是烧钱才能做的research🙍‍♂️

兄弟也是ustc的？加个微信交流一下？我微信号Kiss_The_Rain8，麻烦加一下哈，向你请教一下

Ayush Rai · Answer 22 · Wed Apr 05 2023 03:30:38 GMT+0800 (China Standard Time)

Hi @AlonzoLeeeooo
Do I need to finetune the autoencoder separately (stage 1) on my custom dataset and then finetune the inpainting_big model by modifying the input in ddpm.py as in inpaint.py (stage 2) on my custom dataset? Or only Stage 2 would work. Please help.

USTC-liuchang · Answer 23 · Fri Apr 07 2023 15:25:42 GMT+0800 (China Standard Time)

Hi @AlonzoLeeeooo Do I need to finetune the autoencoder separately (stage 1) on my custom dataset and then finetune the inpainting_big model by modifying the input in ddpm.py as in inpaint.py (stage 2) on my custom dataset? Or only Stage 2 would work. Please help.

Hi @rayush7 ,
As far as I am concerned, you don't need to tune the model parameters of the VQ model (stage 1). Since the official one is trained on open images dataset, it should be sufficient to encode most of the images. Only finetuning stage 2 should be able to work.

Ayush Rai · Answer 24 · Sun Apr 09 2023 23:17:52 GMT+0800 (China Standard Time)

Thank you @AlonzoLeeeooo
I will give it a try.

James Fury · Answer 25 · Mon Apr 10 2023 19:37:09 GMT+0800 (China Standard Time)

how to prepare the data for inpaint?

Ayush Rai · Answer 26 · Tue Apr 11 2023 02:41:56 GMT+0800 (China Standard Time)

The papers mentions that the data preparation step is same as in LaMa.
https://github.com/advimman/lama

mumingerlai · Answer 27 · Wed Apr 26 2023 09:59:24 GMT+0800 (China Standard Time)

I suggest you spend some time understanding the whole codebase. It would be a lot easier if you understand the process of how Stable Diffusion works and how they implement this process. Althout it might take a while.

Generally speaking, you could modify ldm/models/diffusion/ddpm.py according to the script scripts/inpaint.py, which would be used while inference. In inpaint.py, line 79 we could see that the mask in each batch is downsampled to the same size as we pass the masked image through VQ model. It is implemented using nn.interpolate() and then concatenated with the encoded masked image. We should also keep the same mode of adding mask while we are training. So, the number of input channels of the whole U-Net should be 7 channels (image, 3 channels + masked image, 3 channels + mask, 1 channel = 7 channels), and mask and masked image should be concatenated with the input image in the same way while we are inference. In this manner, modifying correpsonding lines in ddpm.py should be able to work.

Could I have a look at your modified code for this part, thank you very much if I could ! ! !

USTC-liuchang · Answer 28 · Wed Apr 26 2023 14:01:48 GMT+0800 (China Standard Time)

Hi @mumingerlai ,

I really would like to help but since I was working on another project related to the same codebase, a huge amount of modifications upon the codebase have been made and it would be quite difficult for me to retrieve the corresponding parts of inpainting. Sorry for not able to do you the favor.

But if there is any other problem about the modification, please feel free to discuss in this issue and I would try my best to recall and answer.

Regards,
Chang

mumingerlai · Answer 29 · Wed Apr 26 2023 14:38:17 GMT+0800 (China Standard Time)

Hi @mumingerlai ,

I really would like to help but since I was working on another project related to the same codebase, a huge amount of modifications upon the codebase have been made and it would be quite difficult for me to retrieve the corresponding parts of inpainting. Sorry for not able to do you the favor.

But if there is any other problem about the modification, please feel free to discuss in this issue and I would try my best to recall and answer.

Regards, Chang

I feel very happy for your reply. I have modified inpainting. py and concated images, masked images, and masks. It seems that they can also run normally! Anyway, thank you very much!

ssl · Answer 30 · Thu Apr 27 2023 10:28:11 GMT+0800 (China Standard Time)

Hi @mumingerlai ,
I really would like to help but since I was working on another project related to the same codebase, a huge amount of modifications upon the codebase have been made and it would be quite difficult for me to retrieve the corresponding parts of inpainting. Sorry for not able to do you the favor.
But if there is any other problem about the modification, please feel free to discuss in this issue and I would try my best to recall and answer.
Regards, Chang

I feel very happy for your reply. I have modified inpainting. py and concated images, masked images, and masks. It seems that they can also run normally! Anyway, thank you very much!

Hello, could I have a look at your modification method and data config for inpainting train? These tasks are difficult for me. Thank you very much!

CVDDL · Answer 31 · Thu Apr 27 2023 23:25:08 GMT+0800 (China Standard Time)

@AlonzoLeeeooo
Hi, during training is your diffusion process only on the image latent features or on the whole [image+masked_image+mask] feature maps? Thanks!

USTC-liuchang · Answer 32 · Thu Apr 27 2023 23:43:27 GMT+0800 (China Standard Time)

Hi @shensongli ,
The configuration file is the same as the official one. You may find it in the same folder as the downloaded model weights inpaint_big. Additionally, you may need to write a dataset.py and modify the data parts in the .yaml configuration file.

Here is my implementation of an example. I follow the free-form mask setting in DeepFill-v2. Hope this would help!

import os, sys, yaml, pickle, shutil, tarfile, glob
import cv2
import albumentations
import PIL
import numpy as np
import torchvision.transforms.functional as TF
from omegaconf import OmegaConf
from functools import partial
from PIL import Image
from tqdm import tqdm
import lmdb

import torch
from torch.utils.data import Dataset, Subset

class InpaintingTrain(Dataset):
    def __init__(self, size, data_root, config=None):
        self.size = size
        self.config = config or OmegaConf.create()
        self.image_flist = self.get_files_from_txt(data_root)


    def generate_stroke_mask(self, im_size, parts=4, maxVertex=25, maxLength=80, maxBrushWidth=40, maxAngle=360):
        
        mask = np.zeros((im_size[0], im_size[1], 1), dtype=np.float32)
        for i in range(parts):
            mask = mask + self.np_free_form_mask(maxVertex, maxLength, maxBrushWidth, maxAngle, im_size[0], im_size[1])
        mask = np.minimum(mask, 1.0)

        return mask


    def np_free_form_mask(self, maxVertex, maxLength, maxBrushWidth, maxAngle, h, w):

        mask = np.zeros((h, w, 1), np.float32)
        numVertex = np.random.randint(maxVertex + 1)
        startY = np.random.randint(h)
        startX = np.random.randint(w)
        brushWidth = 0
        for i in range(numVertex):
            angle = np.random.randint(maxAngle + 1)
            angle = angle / 360.0 * 2 * np.pi
            if i % 2 == 0:
                angle = 2 * np.pi - angle
            length = np.random.randint(maxLength + 1)
            brushWidth = np.random.randint(10, maxBrushWidth + 1) // 2 * 2
            nextY = startY + length * np.cos(angle)
            nextX = startX + length * np.sin(angle)
            nextY = np.maximum(np.minimum(nextY, h - 1), 0).astype(np.int)
            nextX = np.maximum(np.minimum(nextX, w - 1), 0).astype(np.int)
            cv2.line(mask, (startY, startX), (nextY, nextX), 1, brushWidth)
            cv2.circle(mask, (startY, startX), brushWidth // 2, 2)
            startY, startX = nextY, nextX
        cv2.circle(mask, (startY, startX), brushWidth // 2, 2)
        
        return mask


    def get_files_from_txt(self, path):

        file_list = []
        f = open(path)
        for line in f.readlines():
            line = line.strip("\n")
            file_list.append(line)
            sys.stdout.flush()
        f.close()

        return file_list


    def get_files(self, path):

        # read a folder, return the complete path
        ret = []
        for root, dirs, files in os.walk(path):
            for filespath in files:
                ret.append(os.path.join(root, filespath))

        return ret


    def __len__(self):
        return len(self.image_flist)


    def __getitem__(self, i):
        
        image = np.array(Image.open(self.image_flist[i]).convert("RGB"))
        image = cv2.resize(image, (self.size, self.size))
        image = image.astype(np.float32) / 255.0
        image = torch.from_numpy(image)

        mask = self.generate_stroke_mask([self.size, self.size])
        mask[mask < 0.5] = 0
        mask[mask >= 0.5] = 1
        mask = torch.from_numpy(mask)

        masked_image = (1 - mask) * image

        batch = {"image": image, "mask": mask, "masked_image": masked_image}
        for k in batch:
            batch[k] = batch[k] * 2.0 - 1.0

        return batch

Regards,
Chang

USTC-liuchang · Answer 33 · Thu Apr 27 2023 23:47:00 GMT+0800 (China Standard Time)

@AlonzoLeeeooo Hi, during training is your diffusion process only on the image latent features or on the whole [image+masked_image+mask] feature maps? Thanks!

Hi @wtliao ,

I think the diffusion process is implemented only on the image latent, which is the same as their official SD v2.0-inpainting. I guess both mask and masked image could be regarded as a strong condition?

Regards
Chang

CVDDL · Answer 34 · Fri Apr 28 2023 00:00:44 GMT+0800 (China Standard Time)

@AlonzoLeeeooo
谢谢解答，方便价格vx讨论交流吗，我的wtliao

ssl · Answer 35 · Fri Apr 28 2023 16:58:49 GMT+0800 (China Standard Time)

Hi @shensongli , The configuration file is the same as the official one. You may find it in the same folder as the downloaded model weights inpaint_big. Additionally, you may need to write a dataset.py and modify the data parts in the .yaml configuration file.

Here is my implementation of an example. I follow the free-form mask setting in DeepFill-v2. Hope this would help!

import os, sys, yaml, pickle, shutil, tarfile, glob
import cv2
import albumentations
import PIL
import numpy as np
import torchvision.transforms.functional as TF
from omegaconf import OmegaConf
from functools import partial
from PIL import Image
from tqdm import tqdm
import lmdb

import torch
from torch.utils.data import Dataset, Subset

class InpaintingTrain(Dataset):
    def __init__(self, size, data_root, config=None):
        self.size = size
        self.config = config or OmegaConf.create()
        self.image_flist = self.get_files_from_txt(data_root)


    def generate_stroke_mask(self, im_size, parts=4, maxVertex=25, maxLength=80, maxBrushWidth=40, maxAngle=360):
        
        mask = np.zeros((im_size[0], im_size[1], 1), dtype=np.float32)
        for i in range(parts):
            mask = mask + self.np_free_form_mask(maxVertex, maxLength, maxBrushWidth, maxAngle, im_size[0], im_size[1])
        mask = np.minimum(mask, 1.0)

        return mask


    def np_free_form_mask(self, maxVertex, maxLength, maxBrushWidth, maxAngle, h, w):

        mask = np.zeros((h, w, 1), np.float32)
        numVertex = np.random.randint(maxVertex + 1)
        startY = np.random.randint(h)
        startX = np.random.randint(w)
        brushWidth = 0
        for i in range(numVertex):
            angle = np.random.randint(maxAngle + 1)
            angle = angle / 360.0 * 2 * np.pi
            if i % 2 == 0:
                angle = 2 * np.pi - angle
            length = np.random.randint(maxLength + 1)
            brushWidth = np.random.randint(10, maxBrushWidth + 1) // 2 * 2
            nextY = startY + length * np.cos(angle)
            nextX = startX + length * np.sin(angle)
            nextY = np.maximum(np.minimum(nextY, h - 1), 0).astype(np.int)
            nextX = np.maximum(np.minimum(nextX, w - 1), 0).astype(np.int)
            cv2.line(mask, (startY, startX), (nextY, nextX), 1, brushWidth)
            cv2.circle(mask, (startY, startX), brushWidth // 2, 2)
            startY, startX = nextY, nextX
        cv2.circle(mask, (startY, startX), brushWidth // 2, 2)
        
        return mask


    def get_files_from_txt(self, path):

        file_list = []
        f = open(path)
        for line in f.readlines():
            line = line.strip("\n")
            file_list.append(line)
            sys.stdout.flush()
        f.close()

        return file_list


    def get_files(self, path):

        # read a folder, return the complete path
        ret = []
        for root, dirs, files in os.walk(path):
            for filespath in files:
                ret.append(os.path.join(root, filespath))

        return ret


    def __len__(self):
        return len(self.image_flist)


    def __getitem__(self, i):
        
        image = np.array(Image.open(self.image_flist[i]).convert("RGB"))
        image = cv2.resize(image, (self.size, self.size))
        image = image.astype(np.float32) / 255.0
        image = torch.from_numpy(image)

        mask = self.generate_stroke_mask([self.size, self.size])
        mask[mask < 0.5] = 0
        mask[mask >= 0.5] = 1
        mask = torch.from_numpy(mask)

        masked_image = (1 - mask) * image

        batch = {"image": image, "mask": mask, "masked_image": masked_image}
        for k in batch:
            batch[k] = batch[k] * 2.0 - 1.0

        return batch

Regards, Chang

I feel very happy for your reply.Thank you very much!

ssl · Answer 36 · Sun Apr 30 2023 14:46:22 GMT+0800 (China Standard Time)

Do u mean in the dataloader, masked_image will have masked_image, mask and image? or if u mean ddpm.py could u specify where please.

Hi, could you tell me where to change? Thx a lot!

hello,could you show me the specific modified code? I just started dl and read the DDPM code for a long time, but I don't know how to add it.Thank you very much!

dreamlychina · Answer 37 · Fri Jun 16 2023 14:31:26 GMT+0800 (China Standard Time)

Hi @zaryabmakram ,
I didn't successfully re-train the diffusion model. The results are always blur even if the model is trained for 3 days. Empirically this is caused by insufficient training. Afterwards, I notice that the reported GPU requirement in the supplementary materials of Stable Diffusion is 8 V100 GPUs. Due to limited computational resources, I have to give up the idea of reproducing it.

你好！我对你的训练细节很感兴趣。请问你训练了3天，使用了多大的batchsize，一共占用了多少显存，以及训练了多少个epoch呢？我还没有进行复现，但我对latent-diffusion文中所说的”能够减少显存开销“很感兴趣，他真的能够通过latent space来达到减少显存的效果吗？期待你的回复！

你好@DongyangHuLi，
我设置的batch size是48，但是为了节省显存，我将model channels降低到了128，训练了应该有600k次迭代次数，使用的是两张3090来做训练。对于latent-diffusion所说的“减少显存开销”的说法，应该是相比于DDPM说的，实际上需要的显卡需求依然不小，对于inpainting的setting来说，依然需要8张Tesla v100来训练。所以可以说我自身的算力条件依然不足以复现原论文的效果。以上供参考，希望能帮到你。

谢谢！那这么说，扩散模型没有足够的硬件资源是很难work得了了😔

是的，基本就是烧钱才能做的research🙍‍♂️

兄弟也是ustc的？加个微信交流一下？我微信号Kiss_The_Rain8，麻烦加一下哈，向你请教一下

你的微信号加不了

Junhao JI · Answer 38 · Tue Aug 01 2023 17:00:49 GMT+0800 (China Standard Time)

你好，想问一下，这个损失函数怎么调出来的，还是说要自己写回调函数？

USTC-liuchang · Answer 39 · Wed Aug 02 2023 00:06:52 GMT+0800 (China Standard Time)

你好，想问一下，这个损失函数怎么调出来的，还是说要自己写回调函数？

损失函数应该直接用diffusion model的MSE loss就可以了吧？你说的回调函数具体指的是是什么呢？

Junhao JI · Answer 40 · Wed Aug 02 2023 00:15:35 GMT+0800 (China Standard Time)

你好，想问一下，这个损失函数怎么调出来的，还是说要自己写回调函数？

损失函数应该直接用diffusion model的MSE loss就可以了吧？你说的回调函数具体指的是是什么呢？

我可能没表述清楚，我是想问这个图有没有简单的方法在训练结束后直接显示，还是说只能记录每个epoch的结果，不清楚是不是main里面的wandb之类的

USTC-liuchang · Answer 41 · Wed Aug 02 2023 00:34:45 GMT+0800 (China Standard Time)

你好，想问一下，这个损失函数怎么调出来的，还是说要自己写回调函数？

损失函数应该直接用diffusion model的MSE loss就可以了吧？你说的回调函数具体指的是是什么呢？

我可能没表述清楚，我是想问这个图有没有简单的方法在训练结束后直接显示，还是说只能记录每个epoch的结果，不清楚是不是main里面的wandb之类的

你指的是训练的曲线图吧？stable diffusion的codebase里面默认有用tensorboard同步记录loss的，wandb的话不清楚官方有没有实现，没有的话可能需要你动动手改一下ddpm.py把wandb.log()加进去。

CVDDL · Answer 42 · Thu Aug 03 2023 23:43:48 GMT+0800 (China Standard Time)

@AlonzoLeeeooo 你好，我想请问一下，训练inpainting的时候，有没有设置unconditional probability,就是以一定的概率，10%，把text设置为空，以此保证模型unconditional的能力。另外lr，设置多大合适，scale_lr设置为false吧？我训练出来的结果很糟糕。

USTC-liuchang · Answer 43 · Fri Aug 04 2023 00:24:50 GMT+0800 (China Standard Time)

@AlonzoLeeeooo 你好，我想请问一下，训练inpainting的时候，有没有设置unconditional probability,就是以一定的概率，10%，把text设置为空，以此保证模型unconditional的能力。另外lr，设置多大合适，scale_lr设置为false吧？我训练出来的结果很糟糕。

这个应该是要设置的，设置的原因是因为采样的时候可以用classifier-free guidance，classifier-free guidance据观察对diffusion model的采样帮助很大。lr这个得调，具体可能跟你的训练数据有关系，你也可以参考一下官方paper补充材料里面提供的一些超参数设置（实验没记错的话应该是在Places2上面做的），他这个scale_lr没记错的话是根据你的训练用的显卡数量把lr平均。具体训练效果的话你训练了多久？有可能是没有训够，做inpainting有可能需要训一周以上（具体我没有尝试过，一个可供参考的数据是在CelebA人脸的无条件生成任务下训好一个模型大约需要4天左右）

CVDDL · Answer 44 · Fri Aug 04 2023 05:35:52 GMT+0800 (China Standard Time)

@AlonzoLeeeooo 感谢回答。我用64张卡， batchsize每张卡4，训练了大概200k iterations，大概一周。出现了几个问题，scale_lr=True开始不久就直接崩了，输出只有noise。另一个问题是训练iterations越多，生成的背景就越简单。这个我百思不得其解。我用的是stable-diffusion的官方代码，Unet的input channel是9， autoencoder输出channel是4。
你提到的“scale_lr没记错的话是根据你的训练用的显卡数量把lr平均”，我看code好像是直接lrbatchsize#GPs。这样原先5e-5d的lr，累积后就是1.28e-3，模型好像有点敏感。另外， finetune过程中，clip是用同一lr一起fine-tune吗？

USTC-liuchang · Answer 45 · Sat Aug 05 2023 14:21:38 GMT+0800 (China Standard Time)

@AlonzoLeeeooo 感谢回答。我用64张卡， batchsize每张卡4，训练了大概200k iterations，大概一周。出现了几个问题，scale_lr=True开始不久就直接崩了，输出只有noise。另一个问题是训练iterations越多，生成的背景就越简单。这个我百思不得其解。我用的是stable-diffusion的官方代码，Unet的input channel是9， autoencoder输出channel是4。
你提到的“scale_lr没记错的话是根据你的训练用的显卡数量把lr平均”，我看code好像是直接lrbatchsize#GPs。这样原先5e-5d的lr，累积后就是1.28e-3，模型好像有点敏感。另外， finetune过程中，clip是用同一lr一起fine-tune吗？

是在Places2上做的训练吗？diffusion比较吃数据量，太小的数据集可能会做不好。按理来说这个设定应该能得到一个初步结果，建议你最好检查一下有没有bug，scale_lr 这个是我记错了，不过1.28e-3可能有点大了，训练可能会飞很正常。训inpainting不用clip吧，你用的config文件是inpaint_big那个吗？

CVDDL · Answer 46 · Mon Aug 07 2023 19:00:39 GMT+0800 (China Standard Time)

@AlonzoLeeeooo 感谢回答。我用64张卡， batchsize每张卡4，训练了大概200k iterations，大概一周。出现了几个问题，scale_lr=True开始不久就直接崩了，输出只有noise。另一个问题是训练iterations越多，生成的背景就越简单。这个我百思不得其解。我用的是stable-diffusion的官方代码，Unet的input channel是9， autoencoder输出channel是4。
你提到的“scale_lr没记错的话是根据你的训练用的显卡数量把lr平均”，我看code好像是直接lr_batchsize_#GPs。这样原先5e-5d的lr，累积后就是1.28e-3，模型好像有点敏感。另外， finetune过程中，clip是用同一lr一起fine-tune吗？

是在Places2上做的训练吗？diffusion比较吃数据量，太小的数据集可能会做不好。按理来说这个设定应该能得到一个初步结果，建议你最好检查一下有没有bug，scale_lr 这个是我记错了，不过1.28e-3可能有点大了，训练可能会飞很正常。训inpainting不用clip吧，你用的config文件是inpaint_big那个吗？

我用的stable diffusion model v2的模型，inpainting的输入包括了text。我在laion 4.5 dataset上训练的。

wenqing wang · Answer 47 · Thu Aug 17 2023 13:58:24 GMT+0800 (China Standard Time)

@AlonzoLeeeooo 你好，我想请问一下，训练inpainting的时候，有没有设置unconditional probability,就是以一定的概率，10%，把text设置为空，以此保证模型unconditional的能力。另外lr，设置多大合适，scale_lr设置为false吧？我训练出来的结果很糟糕。

这个应该是要设置的，设置的原因是因为采样的时候可以用classifier-free guidance，classifier-free guidance据观察对diffusion model的采样帮助很大。lr这个得调，具体可能跟你的训练数据有关系，你也可以参考一下官方paper补充材料里面提供的一些超参数设置（实验没记错的话应该是在Places2上面做的），他这个scale_lr没记错的话是根据你的训练用的显卡数量把lr平均。具体训练效果的话你训练了多久？有可能是没有训够，做inpainting有可能需要训一周以上（具体我没有尝试过，一个可供参考的数据是在CelebA人脸的无条件生成任务下训好一个模型大约需要4天左右）

@AlonzoLeeeooo 您好，请问在classifier-free guidance的sample中，您知道参数unconditional_conditioning应该怎样设置么，如果是none（ddim.py中），只调整unconditional_guidance_scale的话，好像并没有其作用，见下图：

同时，根据latent-imagenet-diffusion.ipynb这个参考，class条件会直接设成超出的类的数字，然后变成embedding传给unconditional_conditioning，但我不太清楚，如果是Image或者text，应该怎么做呢？

USTC-liuchang · Answer 48 · Fri Aug 18 2023 11:34:42 GMT+0800 (China Standard Time)

@AlonzoLeeeooo 你好，我想请问一下，训练inpainting的时候，有没有设置unconditional probability,就是以一定的概率，10%，把text设置为空，以此保证模型unconditional的能力。另外lr，设置多大合适，scale_lr设置为false吧？我训练出来的结果很糟糕。

这个应该是要设置的，设置的原因是因为采样的时候可以用classifier-free guidance，classifier-free guidance据观察对diffusion model的采样帮助很大。lr这个得调，具体可能跟你的训练数据有关系，你也可以参考一下官方paper补充材料里面提供的一些超参数设置（实验没记错的话应该是在Places2上面做的），他这个scale_lr没记错的话是根据你的训练用的显卡数量把lr平均。具体训练效果的话你训练了多久？有可能是没有训够，做inpainting有可能需要训一周以上（具体我没有尝试过，一个可供参考的数据是在CelebA人脸的无条件生成任务下训好一个模型大约需要4天左右）

@AlonzoLeeeooo 您好，请问在classifier-free guidance的sample中，您知道参数unconditional_conditioning应该怎样设置么，如果是none（ddim.py中），只调整unconditional_guidance_scale的话，好像并没有其作用，见下图：同时，根据latent-imagenet-diffusion.ipynb这个参考，class条件会直接设成超出的类的数字，然后变成embedding传给unconditional_conditioning，但我不太清楚，如果是Image或者text，应该怎么做呢？

你好，这里设置为1的话默认是不使用classifier-free guidance，然后理论上越大样本的质量越好，没记错的话官方的默认设置是7.5。图像的话我不太清楚，你可以参考一些img2img的工作，text的话应该是设置一个空字符串['']。