Support for NormSoftmax

Question

Support for NormSoftmax

catid opened this issue 9 months ago · comments

Based on this paper: https://openreview.net/pdf?id=4g7nCbpjNwd

Would require editing this line:

https://github.com/lucidrains/x-transformers/blob/aabee05d6bca6d74646156009159c55f8d27d884/x_transformers/attend.py#L278C70-L278C75

And replacing the * scale with:

    if self.norm_softmax:
        dots = dots / torch.clamp(dots.std(dim=-1, keepdim=True), min=1e-6)
    else:
        dots *= scale

And then something similar in the other flash attention path

Phil Wang · Answer 1 · Tue Dec 12 2023 04:07:34 GMT+0800 (China Standard Time)

@catid oh interesting, reminds me a bit of https://arxiv.org/abs/2005.09561

there will also be a temperature involved

have you tried this? maybe i can run a quick experiment tonight

Phil Wang · Answer 2 · Tue Dec 12 2023 04:07:56 GMT+0800 (China Standard Time)

it won't be compatible with flash attention

Chris Taylor · Answer 3 · Tue Dec 12 2023 04:10:11 GMT+0800 (China Standard Time)

NormSoftmax CIFAR-10 benchmark results at epoch=60 using ViT-tiny:
baseline : 77.69%
sqrtd: 76.39%
inf: 77.53%

NormSoftmax CIFAR-10 benchmark results at epoch=300 using ViT-tiny:
baseline: 85.19%
inf: 85.07%

Manages to get about the same result without the extra parameters

Phil Wang · Answer 4 · Tue Dec 12 2023 04:14:03 GMT+0800 (China Standard Time)

@catid well yea, so they claim. cifar-10 is a tiny benchmark too

Phil Wang · Answer 5 · Tue Dec 12 2023 04:15:28 GMT+0800 (China Standard Time)

another engineering obstacle would be handling a masked standard dev

Phil Wang · Answer 6 · Tue Dec 12 2023 04:16:06 GMT+0800 (China Standard Time)

yea, let me run it tonight on enwik8, but if i don't see anything notable on the first or second try, probably will just drop this

Chris Taylor · Answer 7 · Tue Dec 12 2023 04:16:26 GMT+0800 (China Standard Time)

@lucidrains The masked stddev is like this right? https://github.com/catid/cifar10deepspeed/blob/fe5b399c5ab5f3ed11235d3dbe72952ce7c2be46/models/vit_small.py#L75

I think that's what I'm testing

Phil Wang · Answer 8 · Tue Dec 12 2023 04:17:10 GMT+0800 (China Standard Time)

@catid i'm thinking for autoregressive text generation (gpt), the triangular causal mask. you are masking out the diagonal?

Chris Taylor · Answer 9 · Tue Dec 12 2023 04:17:56 GMT+0800 (China Standard Time)

Yeah I'm just copying your vit_for_small_dataset.py

Phil Wang · Answer 10 · Tue Dec 12 2023 04:18:53 GMT+0800 (China Standard Time)

@catid ohh ok, do you see anything? have you ran the experiments yourself? never trust anything a paper says unless you see the curves in front of you 😆

Chris Taylor · Answer 11 · Tue Dec 12 2023 04:19:11 GMT+0800 (China Standard Time)

The results I shared above are from my setup

Phil Wang · Answer 12 · Tue Dec 12 2023 04:19:54 GMT+0800 (China Standard Time)

@catid wow! ok, i actually put a lot of weight from results from internet randos

ok, let me try it tonight!

Phil Wang · Answer 13 · Tue Dec 12 2023 04:22:31 GMT+0800 (China Standard Time)

@catid wait, your results show norm softmax to be worse than baseline? is that accuracy?

Phil Wang · Answer 14 · Tue Dec 12 2023 04:24:06 GMT+0800 (China Standard Time)

@catid can you share a wandb report with training curves?

Chris Taylor · Answer 15 · Tue Dec 12 2023 04:25:31 GMT+0800 (China Standard Time)

I dunno I mean the numbers are pretty close and I only ran N=1 trial so not sure if one method produces better accuracy than the other. Also I don't have wandb integrated into my scripts yet (haven't learned how to use that yet).

Phil Wang · Answer 16 · Tue Dec 12 2023 04:26:04 GMT+0800 (China Standard Time)

ah, looks to be a negative result.