Strange Results on first step

Question

Strange Results on first step

nbardy opened this issue a year ago · comments

I'm finetuning SD 1.5 at high 576 batch size.(24 gradient accumulation right now for testing on 1 GPU before scaling up)

Trying to get LION working. Getting very strange results on the first step. Seems to reset the model to a weird texture filled state

Step 0 on the left and Step 1 on the right

Here is one I let run longer, It seems to actually be converging 🤔 , but still has the same reset problem at the start.
Step 500, Step 1000, Step 1500

Relevant Code:

       from lion_pytorch import Lion

        optimizer  = Lion(
            params_to_optimize,
            lr=args.learning_rate,
            weight_decay=1e-2,
            betas=(0.95,0.98),
            use_triton=True # set this to True to use cuda kernel w/ Triton lang (Tillet et al)
        )

Relevant parameters:

--batch_size=24 --learning_rate=7.0e-7 --gradient_accumulation_steps=24 --lion_opt --lr_end=7.0e-9 --lr_scheduler=cosine_with_restarts --lr_num_cycles=20 --lr_warmup_steps=0  --max_train_steps=10000 --mixed_precision bf16

No apparent sharp decrease in loss?

More samples across many prompts:

Note black squares are just NSFW filter I believe

Xiangning Chen · Answer 1 · Fri Apr 14 2023 07:32:53 GMT+0800 (China Standard Time)

Hi, thanks for the datapoint.

Do you have a comparison of the commands used for running with Lion and AdamW?

Nicholas Bardy · Answer 2 · Fri Apr 21 2023 15:46:14 GMT+0800 (China Standard Time)

@xiangning-chen same command besides the lr_opt value

Xiangning Chen · Answer 3 · Sat Apr 22 2023 00:39:30 GMT+0800 (China Standard Time)

Oh I meant the learning_rate, lr_end, and weight decay comparison for Lion and AdamW.

Nicholas Bardy · Answer 4 · Sat Apr 22 2023 16:38:20 GMT+0800 (China Standard Time)

They are in the main post.

‘Relevant Code’ is for lion
‘Relevant parameters’ is for Adam

Xiangning Chen · Answer 5 · Mon Apr 24 2023 03:24:47 GMT+0800 (China Standard Time)

@nbardy Sorry I'm a bit confused, in Relevant parameters you set the --lion_opt flag, but this is for Adam?
Can you please just tell me the learning_rate, lr_end, and weight decay for Lion and AdamW respectively, thanks!

Mitchell Wortsman · Answer 6 · Tue May 09 2023 10:11:38 GMT+0800 (China Standard Time)

@nbardy do you get this behaviour without triton?

one thing i noticed here is that the triton code uses auto-tune + in-place updates which may cause issues. on the first step multiple differnt kernels will be launched which all do the same thing to see what is fastest. this is unique to the first step. usually this is not a problem when training from scratch as warmup is used but it may be here

Phil Wang · Answer 7 · Wed May 10 2023 05:04:00 GMT+0800 (China Standard Time)

@mitchellnw thanks for bringing this to my attention Mitchell!

@nbardy do you want to see if 6ab873a addresses the issue?

Nicholas Bardy · Answer 8 · Sat Oct 14 2023 10:09:44 GMT+0800 (China Standard Time)

I have finally got back to training more diffusion models.

Tried upgrading to lion-pytorch==0.1.2 and still getting a reset it seems on first step

https://wandb.ai/nbardy-facet/sd_xl_train_t2iadapter/runs/eey3bj1n?workspace=user-nbardy-facet

Nicholas Bardy · Answer 9 · Sat Oct 14 2023 10:14:01 GMT+0800 (China Standard Time)

lion-pytorch==0.1.2
pytorch-triton==2.1.0+e650d3708b
triton==2.0.0
torch==2.0.1

Nicholas Bardy · Answer 10 · Sat Oct 14 2023 10:27:46 GMT+0800 (China Standard Time)

Turned off lion and it’s still there. This is probably something else from my changes. Will test more next week.

Nicholas Bardy · Answer 11 · Sat Oct 21 2023 23:36:25 GMT+0800 (China Standard Time)

Confirmed fixed.