SCEdit not converge after 2w iterations on COCO dataset.

Question

SCEdit not converge after 2w iterations on COCO dataset.

serend1p1ty opened this issue 7 months ago · comments

Zhengjia Li commented 7 months ago

I have trained SCEdit [2w/5w, 40%] iterations on COCO dataset.

I observed the loss is always around 0.4. The model have not demonstrated controllable capabilities.

Is it normal?

My configuration (sd21_768_sce_ctr_canny.yaml, 2 GPUs):

max_steps=50000
batch_size=4
lr=1e-5

jiangzeyinzi · Answer 1 · Fri Apr 26 2024 10:34:04 GMT+0800 (China Standard Time)

Sorry for the inconvenience this may have caused you. The problem could be due to the learning rate being set too low, which might be affecting convergence. Our framework employs a batch learning rate to accommodate multi-GPU scenarios, which is calculated as "real_lr = yaml_lr * gpu_num * batch / 640". This can also be observed in the log entries under "pg0_lr: 0.xxx". We plan to make adjustments to ensure compatibility with this learning rate setting in the future.
For more information, please see: https://github.com/modelscope/scepter/blob/main/scepter/modules/solver/diffusion_solver.py#L136

Zhengjia Li · Answer 2 · Fri Apr 26 2024 13:05:43 GMT+0800 (China Standard Time)

@jiangzeyinzi Thanks for your reply. Can you tell me approximately what the final loss value is.
Current value is 0.37. Is it normal?

jiangzeyinzi · Answer 3 · Fri Apr 26 2024 13:28:38 GMT+0800 (China Standard Time)

In generative tasks, loss often cannot serve as the central basis for model convergence. In our setup, the loss went from 0.16 to 0.14. Additionally, the loss also shows apparent differences with different data, base models, and condition types. From the perspective of results, under the settings in our paper (with a larger batch size), running about 3k steps generally leads to generated results that are constrained by the conditional images.

Zhengjia Li · Answer 4 · Fri Apr 26 2024 14:00:29 GMT+0800 (China Standard Time)

Understood, I'll train a little more and see the results.

Zhengjia Li · Answer 5 · Mon May 06 2024 17:44:01 GMT+0800 (China Standard Time)

@jiangzeyinzi
After training 5w steps, the model seems converge.

For humans, generation seems to be not very stable.
I am a beginner in the field of text-to-image generation. Does the current result meet expectations?

Next, I plan to train SCEdit on the larger LAION dataset. Your paper mentions that you used a fixed learning rate 5e-5.
In order to reproduce the results in the paper, should I set the learning rate to 0.000125 if batch_size is 256? Because 0.000125*256/640=5e-5.

Looking forward to your reply.

jiangzeyinzi · Answer 6 · Mon May 06 2024 18:56:04 GMT+0800 (China Standard Time)

I believe it meets the expectations, as the COCO dataset contains a large number of images with multiple subjects and small faces, which are significant challenges in generation tasks. Training with premium data is a good approach. Additionally, a reasonable learning rate in experiments will not cause a particularly large effect deviation. It is recommended to set it at 5e-5 without having to follow the batch learning rate setting. I hope you achieve good results.

Zhengjia Li · Answer 7 · Mon May 06 2024 22:09:08 GMT+0800 (China Standard Time)

Thanks for you reply.