cloneofsimo / minRF

Minimal implementation of scalable rectified flow transformers, based on SD3's approach

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Unused parameters

zaptrem opened this issue · comments

self.normC2 = Fp32LayerNorm(dim, bias=False)

self.w1o = nn.Linear(dim, dim, bias=False)

Not used in last layer, should be moved into an if not last statement. Unused parameters make some distributed algos slow and sad: https://pytorch.org/docs/stable/notes/ddp.html#internal-design

Edit: Also, (unless I misread your code) you seem to only put the timestep embedding in the AdaLN scale/shift thingy, but the SD3 paper also puts a vector made from the image description in there. Did you find the former worked better?

Edit 2: Also also, did your muP optimization lead that far from a 1e^-4 learning rate? Can you share the results of your hparam search?

Ah yes, you are correct.

| Edit: Also, (unless I misread your code) you seem to only put the timestep embedding in the AdaLN scale/shift thingy, but the SD3 paper also puts a vector made from the image description in there. Did you find the former worked better?

I just dont find clip embedding useful when I inference with them. Kinda my personal thing.
Because muP devides the global learning rate with input dimension, its actually more like 1e-4 in practice for fat layers.
For biases or input, its much larger, which is the rationale behind muP.