LTH14 / mar

PyTorch implementation of MAR+DiffLoss https://arxiv.org/abs/2406.11838

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

The CFG strategy - linear. vs constant

yuhuUSTC opened this issue · comments

Thanks for the great work!
I meet a confusing question when replicating this work. I retrain this code and evaluate the performance between different CFG strategies provided in the code, linear vs constant. I find that constant cfg has a much worse FID but much better sFID. The contradiction between FID and sFID is confusing. Besides, the IS and Recall also seems contradict each other.
截屏2024-09-05 11 28 46

Thanks for your interest! Constant CFG typically results in a very high inception score (in our final model, around 500 IS) but poor FID -- that's also the reason why it achieves very high precision. Linear CFG is used to improve the diversity of the generated images, so that the FID is improved as well as the recall.

Also, 200 epochs should result in an FID < 3 if you follow our default training setting.

Thanks for the reply!
The above table follows the default training setting, expect that the sampling step is 16.
The question for the above table is that FID and sFID strongly contradicts each other, which is very rare and confusion.

Another question is that for 128x128 generation, the result are opposite to the above table. In 128x128, constant_cfg generates much better FID but much worse sFID than the linear_cfg.
截屏2024-09-05 13 17 31
This result is also very confusing, and contradicts with 256x256 result.

It would be good to sweep the CFG again if you change other configs, such as resolution and sampling steps. The optimal CFG typically changes if these configs change.

Also, even with 16 sampling steps, the FID seems too high -- as shown in Figure 6 of the paper, the FID using 16 sampling steps should be around 3.0 after 400 epochs training.

About the FID value. I also think it is too high. To verify this, I employ the provided official checkpoint of MAR_L and choose the step=16 to generate 50k images and test the FID. All the setting totally follows this official recommendation without any modifications. After generating 50k images, I employ the evaluation suite provided by guided_diffusion for FID calculation. However, the FID is 6.23.
Do you have any suggestions for this abnormal.

Can you try step=64 and see if you can reproduce the result? This 6.23 FID is too high. step=64, cfg=3.0 should give you <2.0 FID.

I find that inference step significantly affects the FID.
截屏2024-09-06 13 29 27
with step=64, it can achieve FID 2.14. This shows that the FID-step curve is different from the paper in small step.

Besdies, constant CFG has very high FID. I am curious whether this finding is consistent with yours.

First of all, your FID is still higher than our results. For instance, at 256 steps it should be 1.78 FID. Besides, for small generation steps, the optimal CFG scale is no longer 3.0 and should typically be larger. In our experiments, we sweep the optimal CFG and temperature to find the best FID, for all generation steps.

Constant CFG, when using the same scale as linear CFG, typically has very high FID and also high IS (>10 FID and >450 IS). This is because it sacrifices diversity for better fidelity.

I would suggest to first identify why your 256 steps result is different from ours, and then sweep the generation parameters (cfg_scale and temperature) for smaller generation steps.

Thanks very much for your valuabale suggestions. I did no realize that you sweep the CFG for different inference settings, and I will conduct more experiments following your suggestions.

About the 1.78 and 1.98 FID. The 1.98 FID is achieved by totally following the repository without any modification. The inference setting is thus same with the following:

torchrun --nproc_per_node=8 --nnodes=1 --node_rank=0
main_mar.py
--model mar_large --diffloss_d 8 --diffloss_w 1280
--eval_bsz 256 --num_images 50000
--num_iter 256 --num_sampling_steps 100 --cfg 3.0 --cfg_schedule linear --temperature 1.0
--output_dir pretrained_models/mar/mar_large
--resume pretrained_models/mar/mar_large
--data_path ${IMAGENET_PATH} --evaluate

Do you have any suggestions? Thanks again for your help!

I cannot diagnose your exact issue, but here are some reference points. I just ran an evaluation on 8 A6000 GPUs (using your command above) and here is the result:

image

A typical fluctuation for FID should be smaller than 0.05, and a typical fluctuation for IS be smaller than 5. I also once validated our model's performance on L40s, H100, A100 and V100.

Since we fix the random seed for generation, this result should be exactly reproducible if you use the same eval_bsz and number of GPUs.