Open for review apporches.

Graph Transformer networks are an emerging trend in the field of deep learning, offering promising results in tasks such as graph classification and node labeling. With this in mind, this repository is opening its doors for a more comprehensive review of Graph Transformers. My goal is to further advance this exciting field and provide a deeper understanding of its capabilities and limitations. Join to me (write: krzywda@agh.edu.pl)and help in exploring the potential of Graph Transformers and contributing to their development. Stay tuned for more updates on this exciting journey.

I am open for working on more complex survey/review Graph Transformers.

GraphTransformers

Inspirations, approaches, datasets and state-of-art about Transformers and its variations.
Only for research and better understanding ideas of (Graph) Transformers Network.

Transformers Overview
Transformers Components
2.1 Attention Mechanism
2.2 Scaled Dot Product Attention
2.3 Multi-Head Attention
2.4 Transformer Encoder
2.5 Positional encoding
Graph Transformers Neural Network (GTNN)
Graph Transformers Neural Network Components
5.1 GNNs as Auxiliary Modules in Transformer
5.2 Improved Positional Embeddings from Graphs
5.3 Improved Attention Matrices from Graphs
5.4 Graph Attention Network (GAT)
5.5 Feed-Forward MLP
Transformers Neural Network References
Graph Transformers Neural Network References
Codes

Transformers Overview


Source: Attention Is All You Need

Transformer neural nets are a recent class of neural networks for sequences, based on self-attention, that have been shown to be well adapted to text and are currently driving important progress in natural language processing.

Transformers Components

Attention Mechanism

The attention mechanism describes a recent new group of layers in neural networks that has attracted a lot of interest in the past few years, especially in sequence tasks. There are a lot of different possible definitions of "attention" in the literature, but the one we will use here is the following: the attention mechanism describes a weighted average of (sequence) elements with the weights dynamically computed based on an input query and elements' keys. So what does this exactly mean? The goal is to take an average over the features of multiple elements. However, instead of weighting each element equally, we want to weight them depending on their actual values. In other words, we want to dynamically decide on which inputs we want to "attend" more than others.

Scaled Dot Product Attention


Source: Attention Is All You Need

The core concept behind self-attention is the scaled dot product attention. Our goal is to have an attention mechanism with which any element in a sequence can attend to any other while still being efficient to compute.

Multi-Head Attention


Source: Attention Is All You Need

The scaled dot product attention allows a network to attend over a sequence. However, often there are multiple different aspects a sequence element wants to attend to, and a single weighted average is not a good option for it. This is why we extend the attention mechanisms to multiple heads, i.e. multiple different query-key-value triplets on the same features. Specifically, given a query, key, and value matrix, we transform those into ℎ sub-queries, sub-keys, and sub-values, which we pass through the scaled dot product attention independently.

Transformer Encoder

Originally, the Transformer model was designed for machine translation. Hence, it got an encoder-decoder structure where the encoder takes as input the sentence in the original language and generates an attention-based representation. On the other hand, the decoder attends over the encoded information and generates the translated sentence in an autoregressive manner, as in a standard RNN. While this structure is extremely useful for Sequence-to-Sequence tasks with the necessity of autoregressive decoding, we will focus here on the encoder part.

Positional encoding

We have discussed before that the Multi-Head Attention block is permutation-equivariant, and cannot distinguish whether an input comes before another one in the sequence or not. In tasks like language understanding, however, the position is important for interpreting the input words. The position information can therefore be added via the input features. We could learn a embedding for every possible position, but this would not generalize to a dynamical input sequence length. Hence, the better option is to use feature patterns that the network can identify from the features and potentially generalize to larger sequences.

Graph Transformers Neural Network

If we were to do multiple parallel heads of neighbourhood aggregation and replace summation over the neighbours with the attention mechanism, i.e., a weighted sum, we'd get the Graph Attention Network (GAT). Add normalization and the feed-forward MLP, and voila, we have a Graph Transformer!

Graph Transformers Neural Network Components

GNNs as Auxiliary Modules in Transformer


Source: Transformer for Graphs: An Overview from Architecture Perspective(https://arxiv.org/pdf/2202.08455.pdf)

The most direct solution of involving structural knowledge to benefit from global relation modeling of self-attention is to combine graph neural networks with Transformer architecture. Generally, according to the relative postion between GNN layers and Transformer layers, existing Transformer architectures with GNNs are categorized into three types as illustrated in Figure 1:

(1) building Transformer blocks on top of GNN blocks,
(2) alternately stacking GNN blocks and Transformer blocks,
(3) parallelizing GNN blocks and Transformer blocks.

The first architecture is most-frequently adopted among the three options. For example, GraphTrans adds a Transformer subnetwork on top of a standard GNN layer. The GNN layer performs as a specialized architecture to learn local representations of the structure of a node’s immediate neighbourhood, while the Transformer subnetwork computes all pairwise node interactions in a position-agnostic fashion, empowering the model global reasoning capability.

GraphTrans is evaluated on graph classification task from biology, computer programming and chemistry, and achieves consistent improvement over benchmarks. Grover consists of two GTransformer modules to represent node-level features and edge-level features respectively. In each GTransformer, the inputs are first fed into a tailored GNNs named dyMPN to extract vectors as queries, keys and values from nodes of the graph, followed by standard multi-head attention blocks. This bi-level information extraction framework enables the model to capture the structural information in molecular data and make it possible to extract global relations between nodes, enhancing the representational power of Grover.

GraphiT also falls in the first architecture, which adopts one Graph Convolutional Kernel Network (GCKN) layer to produce a structureaware representation from original features, and concatenate them as the input of Transformer architecture. Here, GCKNs is a multi-layer model that produces a sequence of graph feature maps similar to a GNN. Different from GNNs, each layer of GCKNs enumerates local sub-structures at each node, encodes them using a kernel embedding, and aggregates the sub-structure representations as outputs. These representations in a feature map carry more structural information than traditional GNNs based on neighborhood aggregation.

Mesh Graphormer follows the second architecture by stacking a Graph Residual Block (GRB) on a multi-head self-attention layer as a Transformer block to model both local and global interactions among 3D mesh vertices and body joints.

Improved Positional Embeddings from Graphs

Although combining graph neural networks and Transformer has shown effectiveness in modeling graph-structured data, the best architecture to incorporate them remains an issue and requires heavy hype-parameter searching. Therefore, it is meaningful to explore a graph-encoding strategy without adjustment of the Transformer architecture. Similar to the positional encoding in Transformer for sequential data such as sentences, it is also possible to compress the graph structure into positional embedding (PE) vectors and add them to the input before it is fed to the actual Transformer model.

Improved Attention Matrices from Graphs

Although node positional embedding is a convenient practice to inject graph priors into Transformer architectures, the progress of compressing graph structure into fixed-sized vectors suffers from information loss, which might limit their effectiveness. One line of models adapts self-attention mechanism to GNN-like architectures by restricting a node only attending to local node neighbours in the graph, which can be computationally formulated as an attention masking mechanism. One possible extension of this practice is masking the attention matrices of different heads with different graph priors. In the original multi-head self-attention blocks, different attention heads implicitly attend to information from different representation subspaces of different nodes. While in this case, using the graph-masking mechanism to enforce the heads explicitly attend to different subspaces with graph priors further improves the model representative capability for graph data.

Graph Attention Network (GAT)


Source: Graph Attention Networks

Graph attention networks (GATs), novel neural network architectures that operate on graph-structured data, leveraging masked self-attentional layers to address the shortcomings of prior methods based on graph convolutions or their approximations. By stacking layers in which nodes are able to attend over their neighborhoods' features, we enable (implicitly) specifying different weights to different nodes in a neighborhood, without requiring any kind of costly matrix operation (such as inversion) or depending on knowing the graph structure upfront. In this way, we address several key challenges of spectral-based graph neural networks simultaneously, and make our model readily applicable to inductive as well as transductive problems.

Feed-Forward MLP

The feed-forward layer is weights that is trained during training and the exact same matrix is applied to each respective token position. Since it is applied without any communcation with or inference by other token positions it is a highly parallelizable part of the model.

mkrzywda / GraphTransformers