cvignac / DiGress

code for the paper "DiGress: Discrete Denoising diffusion for graph generation"

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Question about the architecture (graphTransformer)

Forbu opened this issue · comments

I was looking at your implementation of attention here :
https://github.com/cvignac/DiGress/blob/main/src/models/transformer_model.py#L158

I have some question about the code :

Q = Q.unsqueeze(2)  # (bs, 1, n, n_head, df)
K = K.unsqueeze(1)  # (bs, n, 1, n head, df)

# Compute unnormalized attentions. Y is (bs, n, n, n_head, df)
Y = Q * K

Here I have a question because in the classic attention mecanism we have Y which have a dimension of (bs, n, n, n_head) not feature specific. I don't know if this what the author wanted (this is not proper outer product this is element wise multiplication).

Also a few line after we have :

attn = masked_softmax(Y, softmax_mask, dim=2)  # bs, n, n, n_head
print("attn.shape : ", attn.shape) # i add this

As the attention shape I obtain (bs, n, n, n_head, df) dimension (contrary to the comment).
The code is not really implementing "real" graph transformer attention like other code like :
https://docs.dgl.ai/_modules/dgl/nn/pytorch/gt/egt.html#EGTLayer

But as your code give me better results than the one above (with a proper attention mecanism) I wonder if this is something that the authors made intentionnally.

I am doing some experiment on my own graph dataset. Your implementation seems to be more performant that the standard graph transformer (at least the one I tried from DGL library). Yours clearly achieve to generate more plausible edges.
I am doing more experiements to confirm this (I currently only have "visual" clues and noisy loss curves to back this affirmation).

Your implementation is equivalent of having a classic graph transformer but with as many head as original dimension, so you ends up having heads of only one dimension (I mean if df = 1 you will obtain the same results).