qitianwu / DIFFormer

The official implementation for ICLR23 spotlight paper "DIFFormer: Scalable (Graph) Transformers Induced by Energy Constrained Diffusion"

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Batch computation of Difformer

tongnie opened this issue · comments

Hi! Thanks for demonstrating such an interesting work and sharing your code! I'm very interested in Difformer and I'd like to conduct spatial-temporal prediction tasks based on it. However, in practice, the input data structure for spatial-temporal datasets might be [batch_size, sequence_length, number_of_nodes, feature_dimension], where each mini-batch is split along the sequence dimension, rather than node dimension in typical GNN problems. So is Difformer applicable in this case? Or what possible adjustments might be needed based on your implementations?

Looking forward to your reply and I appreciate it greatly!

Hi Tong, our current experiments for the spatial-temporal case only use the previous graph snapshot as input for predicting the next state and we feed each graph snapshot at one time into the model feedforward. So, in our model implementation (the difformer.py), the input data has dimension [number of nodes, feature_dimension] which is the same as the other two tasks.

For the cases you mentioned, if the input needs to be [batch_size, sequence_length, number_of_nodes, feature_dimension], I think the simplest way you can do is to add two new dimensions to the input x of the DIFFormer class, and treat the first two dimensions independently. In such a way, the all-pair attentions are also applied along the nodes dimension as well. Moreover, you could also apply the all-pair attention along the sequence dimension to accommodate the temporal dependence if needed. The full_attention_conv function in difformer.py is flexibile for arbitrary query/key/value inputs depending the dimension you target for computing the attention.

Hope this illustration will be helpful!

Thanks for your kind suggestion! It is enlightening and I'll give it a try.

commented

Thanks for your kind suggestion! It is enlightening and I'll give it a try.

I try to reduce the feature dimension to 1, only the first feature is selected, then the shape of x is [207,12], the code can run, but the speed is very slow, I don't know what the problem is?

Can you guys tell me more details about the dataset you try. What is the input graph, the edge sparsity and the label? And what is the input you used for the model, e.g., one instance with the shape [207, 12] or a batch of instances with that shape? Is the 207 for node number and 12 for input feature dimension? And, what is the dimension along which the diffusion attention is computed?