Question about the architecture (graphTransformer)

Question

Question about the architecture (graphTransformer)

Forbu opened this issue 5 months ago · comments

I was looking at your implementation of attention here :
https://github.com/cvignac/DiGress/blob/main/src/models/transformer_model.py#L158

I have some question about the code :

Q = Q.unsqueeze(2)  # (bs, 1, n, n_head, df)
K = K.unsqueeze(1)  # (bs, n, 1, n head, df)

# Compute unnormalized attentions. Y is (bs, n, n, n_head, df)
Y = Q * K

Here I have a question because in the classic attention mecanism we have Y which have a dimension of (bs, n, n, n_head) not feature specific. I don't know if this what the author wanted (this is not proper outer product this is element wise multiplication).

Also a few line after we have :

attn = masked_softmax(Y, softmax_mask, dim=2)  # bs, n, n, n_head
print("attn.shape : ", attn.shape) # i add this

As the attention shape I obtain (bs, n, n, n_head, df) dimension (contrary to the comment).
The code is not really implementing "real" graph transformer attention like other code like :
https://docs.dgl.ai/_modules/dgl/nn/pytorch/gt/egt.html#EGTLayer

But as your code give me better results than the one above (with a proper attention mecanism) I wonder if this is something that the authors made intentionnally.

Clement Vignac · Answer 1 · Wed Jan 31 2024 00:58:18 GMT+0800 (China Standard Time)

Copying the answer from #47 your observation is correct. It’s not exactly the standard attention mechanism. I’ve not thoroughly compared the two, but current code was written on purpose. The reason for this is that we have to manipulate features of size (bs, n, n, de) anyway, so using vector attention scores instead of scalar does not create a strong memory bottleneck. I would be interesting to investigate this further, though. Clement Le mar. 30 janv. 2024 à 14:59, Adrien B ***@***.***> a écrit :

…

I was looking at your implementation of attention here : https://github.com/cvignac/DiGress/blob/main/src/models/transformer_model.py#L158 I have some question about the code : Q = Q.unsqueeze(2) # (bs, 1, n, n_head, df)K = K.unsqueeze(1) # (bs, n, 1, n head, df) # Compute unnormalized attentions. Y is (bs, n, n, n_head, df)Y = Q * K Here I have a question because in the classic attention mecanism we have Y which have a dimension of (bs, n, n, n_head) not feature specific. I don't know if this what the author wanted (this is not proper outer product this is element wise multiplication). Also a few line after we have : attn = masked_softmax(Y, softmax_mask, dim=2) # bs, n, n, n_headprint("attn.shape : ", attn.shape) # i add this As the attention shape I obtain (bs, n, n, n_head, df) dimension (contrary to the comment). The code is not really implementing "real" graph transformer attention like other code like : https://docs.dgl.ai/_modules/dgl/nn/pytorch/gt/egt.html#EGTLayer But as your code give me better results than the one above (with a proper attention mecanism) I wonder if this is not something that the authors made intentionnally. — Reply to this email directly, view it on GitHub <#87>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AEJOOTTRSBNRAKC3ZHJCDQ3YREDDPAVCNFSM6AAAAABCRM4D6CVHI2DSMVQWIX3LMV43ASLTON2WKOZSGEYDQMBWHEZTQNA> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

-- *Clément Vignac*

Adrien B · Answer 2 · Wed Jan 31 2024 03:53:31 GMT+0800 (China Standard Time)

I am doing some experiment on my own graph dataset. Your implementation seems to be more performant that the standard graph transformer (at least the one I tried from DGL library). Yours clearly achieve to generate more plausible edges.
I am doing more experiements to confirm this (I currently only have "visual" clues and noisy loss curves to back this affirmation).

Your implementation is equivalent of having a classic graph transformer but with as many head as original dimension, so you ends up having heads of only one dimension (I mean if df = 1 you will obtain the same results).