lucidrains / lion-pytorch

šŸ¦ Lion, new optimizer discovered by Google Brain using genetic algorithms that is purportedly better than Adam(w), in Pytorch

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Strange Results on first step

nbardy opened this issue Ā· comments

I'm finetuning SD 1.5 at high 576 batch size.(24 gradient accumulation right now for testing on 1 GPU before scaling up)

Trying to get LION working. Getting very strange results on the first step. Seems to reset the model to a weird texture filled state

Step 0 on the left and Step 1 on the right

image

Here is one I let run longer, It seems to actually be converging šŸ¤” , but still has the same reset problem at the start.
Step 500, Step 1000, Step 1500
image

Relevant Code:

       from lion_pytorch import Lion

        optimizer  = Lion(
            params_to_optimize,
            lr=args.learning_rate,
            weight_decay=1e-2,
            betas=(0.95,0.98),
            use_triton=True # set this to True to use cuda kernel w/ Triton lang (Tillet et al)
        )

Relevant parameters:

--batch_size=24 --learning_rate=7.0e-7 --gradient_accumulation_steps=24 --lion_opt --lr_end=7.0e-9 --lr_scheduler=cosine_with_restarts --lr_num_cycles=20 --lr_warmup_steps=0  --max_train_steps=10000 --mixed_precision bf16

No apparent sharp decrease in loss?
image

More samples across many prompts:

image

Note black squares are just NSFW filter I believe

Hi, thanks for the datapoint.

Do you have a comparison of the commands used for running with Lion and AdamW?

@xiangning-chen same command besides the lr_opt value

Oh I meant the learning_rate, lr_end, and weight decay comparison for Lion and AdamW.

They are in the main post.

ā€˜Relevant Codeā€™ is for lion
ā€˜Relevant parametersā€™ is for Adam

@nbardy Sorry I'm a bit confused, in Relevant parameters you set the --lion_opt flag, but this is for Adam?
Can you please just tell me the learning_rate, lr_end, and weight decay for Lion and AdamW respectively, thanks!

@nbardy do you get this behaviour without triton?

one thing i noticed here is that the triton code uses auto-tune + in-place updates which may cause issues. on the first step multiple differnt kernels will be launched which all do the same thing to see what is fastest. this is unique to the first step. usually this is not a problem when training from scratch as warmup is used but it may be here

@mitchellnw thanks for bringing this to my attention Mitchell!

@nbardy do you want to see if 6ab873a addresses the issue?

I have finally got back to training more diffusion models.

Tried upgrading to lion-pytorch==0.1.2 and still getting a reset it seems on first step

https://wandb.ai/nbardy-facet/sd_xl_train_t2iadapter/runs/eey3bj1n?workspace=user-nbardy-facet

lion-pytorch==0.1.2
pytorch-triton==2.1.0+e650d3708b
triton==2.0.0
torch==2.0.1

Turned off lion and itā€™s still there. This is probably something else from my changes. Will test more next week.

Confirmed fixed.