Question about the architecture (graphTransformer)
Forbu opened this issue · comments
I was looking at your implementation of attention here :
https://github.com/cvignac/DiGress/blob/main/src/models/transformer_model.py#L158
I have some question about the code :
Q = Q.unsqueeze(2) # (bs, 1, n, n_head, df)
K = K.unsqueeze(1) # (bs, n, 1, n head, df)
# Compute unnormalized attentions. Y is (bs, n, n, n_head, df)
Y = Q * K
Here I have a question because in the classic attention mecanism we have Y which have a dimension of (bs, n, n, n_head) not feature specific. I don't know if this what the author wanted (this is not proper outer product this is element wise multiplication).
Also a few line after we have :
attn = masked_softmax(Y, softmax_mask, dim=2) # bs, n, n, n_head
print("attn.shape : ", attn.shape) # i add this
As the attention shape I obtain (bs, n, n, n_head, df) dimension (contrary to the comment).
The code is not really implementing "real" graph transformer attention like other code like :
https://docs.dgl.ai/_modules/dgl/nn/pytorch/gt/egt.html#EGTLayer
But as your code give me better results than the one above (with a proper attention mecanism) I wonder if this is something that the authors made intentionnally.
I am doing some experiment on my own graph dataset. Your implementation seems to be more performant that the standard graph transformer (at least the one I tried from DGL library). Yours clearly achieve to generate more plausible edges.
I am doing more experiements to confirm this (I currently only have "visual" clues and noisy loss curves to back this affirmation).
Your implementation is equivalent of having a classic graph transformer but with as many head as original dimension, so you ends up having heads of only one dimension (I mean if df = 1 you will obtain the same results).