Graph Transformer networks are an emerging trend in the field of deep learning, offering promising results in tasks such as graph classification and node labeling. With this in mind, this repository is opening its doors for a more comprehensive review of Graph Transformers. My goal is to further advance this exciting field and provide a deeper understanding of its capabilities and limitations. Join to me (write: krzywda@agh.edu.pl)and help in exploring the potential of Graph Transformers and contributing to their development. Stay tuned for more updates on this exciting journey.
I am open for working on more complex survey/review Graph Transformers.
Inspirations, approaches, datasets and state-of-art about Transformers and its variations.
Only for research and better understanding ideas of (Graph) Transformers Network.
- Transformers Overview
- Transformers Components
2.1 Attention Mechanism
2.2 Scaled Dot Product Attention
2.3 Multi-Head Attention
2.4 Transformer Encoder
2.5 Positional encoding - Graph Transformers Neural Network (GTNN)
- Graph Transformers Neural Network Components
5.1 GNNs as Auxiliary Modules in Transformer
5.2 Improved Positional Embeddings from Graphs
5.3 Improved Attention Matrices from Graphs
5.4 Graph Attention Network (GAT)
5.5 Feed-Forward MLP - Transformers Neural Network References
- Graph Transformers Neural Network References
- Codes
Source: Attention Is All You Need |
Transformer neural nets are a recent class of neural networks for sequences, based on self-attention, that have been shown to be well adapted to text and are currently driving important progress in natural language processing.
The attention mechanism describes a recent new group of layers in neural networks that has attracted a lot of interest in the past few years, especially in sequence tasks. There are a lot of different possible definitions of "attention" in the literature, but the one we will use here is the following: the attention mechanism describes a weighted average of (sequence) elements with the weights dynamically computed based on an input query and elements' keys. So what does this exactly mean? The goal is to take an average over the features of multiple elements. However, instead of weighting each element equally, we want to weight them depending on their actual values. In other words, we want to dynamically decide on which inputs we want to "attend" more than others.
Source: Attention Is All You Need |
The core concept behind self-attention is the scaled dot product attention. Our goal is to have an attention mechanism with which any element in a sequence can attend to any other while still being efficient to compute.
Source: Attention Is All You Need |
The scaled dot product attention allows a network to attend over a sequence. However, often there are multiple different aspects a sequence element wants to attend to, and a single weighted average is not a good option for it. This is why we extend the attention mechanisms to multiple heads, i.e. multiple different query-key-value triplets on the same features. Specifically, given a query, key, and value matrix, we transform those into ℎ sub-queries, sub-keys, and sub-values, which we pass through the scaled dot product attention independently.
Originally, the Transformer model was designed for machine translation. Hence, it got an encoder-decoder structure where the encoder takes as input the sentence in the original language and generates an attention-based representation. On the other hand, the decoder attends over the encoded information and generates the translated sentence in an autoregressive manner, as in a standard RNN. While this structure is extremely useful for Sequence-to-Sequence tasks with the necessity of autoregressive decoding, we will focus here on the encoder part.
We have discussed before that the Multi-Head Attention block is permutation-equivariant, and cannot distinguish whether an input comes before another one in the sequence or not. In tasks like language understanding, however, the position is important for interpreting the input words. The position information can therefore be added via the input features. We could learn a embedding for every possible position, but this would not generalize to a dynamical input sequence length. Hence, the better option is to use feature patterns that the network can identify from the features and potentially generalize to larger sequences.
If we were to do multiple parallel heads of neighbourhood aggregation and replace summation over the neighbours with the attention mechanism, i.e., a weighted sum, we'd get the Graph Attention Network (GAT). Add normalization and the feed-forward MLP, and voila, we have a Graph Transformer!
Source: Transformer for Graphs: An Overview from Architecture Perspective(https://arxiv.org/pdf/2202.08455.pdf) |
The most direct solution of involving structural knowledge to benefit from global relation modeling of self-attention is to combine graph neural networks with Transformer architecture. Generally, according to the relative postion between GNN layers and Transformer layers, existing Transformer architectures with GNNs are categorized into three types as illustrated in Figure 1:
- (1) building Transformer blocks on top of GNN blocks,
- (2) alternately stacking GNN blocks and Transformer blocks,
- (3) parallelizing GNN blocks and Transformer blocks.
The first architecture is most-frequently adopted among the three options. For example, GraphTrans adds a Transformer subnetwork on top of a standard GNN layer. The GNN layer performs as a specialized architecture to learn local representations of the structure of a node’s immediate neighbourhood, while the Transformer subnetwork computes all pairwise node interactions in a position-agnostic fashion, empowering the model global reasoning capability.
GraphTrans is evaluated on graph classification task from biology, computer programming and chemistry, and achieves consistent improvement over benchmarks. Grover consists of two GTransformer modules to represent node-level features and edge-level features respectively. In each GTransformer, the inputs are first fed into a tailored GNNs named dyMPN to extract vectors as queries, keys and values from nodes of the graph, followed by standard multi-head attention blocks. This bi-level information extraction framework enables the model to capture the structural information in molecular data and make it possible to extract global relations between nodes, enhancing the representational power of Grover.
GraphiT also falls in the first architecture, which adopts one Graph Convolutional Kernel Network (GCKN) layer to produce a structureaware representation from original features, and concatenate them as the input of Transformer architecture. Here, GCKNs is a multi-layer model that produces a sequence of graph feature maps similar to a GNN. Different from GNNs, each layer of GCKNs enumerates local sub-structures at each node, encodes them using a kernel embedding, and aggregates the sub-structure representations as outputs. These representations in a feature map carry more structural information than traditional GNNs based on neighborhood aggregation.
Mesh Graphormer follows the second architecture by stacking a Graph Residual Block (GRB) on a multi-head self-attention layer as a Transformer block to model both local and global interactions among 3D mesh vertices and body joints.
Although combining graph neural networks and Transformer has shown effectiveness in modeling graph-structured data, the best architecture to incorporate them remains an issue and requires heavy hype-parameter searching. Therefore, it is meaningful to explore a graph-encoding strategy without adjustment of the Transformer architecture. Similar to the positional encoding in Transformer for sequential data such as sentences, it is also possible to compress the graph structure into positional embedding (PE) vectors and add them to the input before it is fed to the actual Transformer model.
Although node positional embedding is a convenient practice to inject graph priors into Transformer architectures, the progress of compressing graph structure into fixed-sized vectors suffers from information loss, which might limit their effectiveness. One line of models adapts self-attention mechanism to GNN-like architectures by restricting a node only attending to local node neighbours in the graph, which can be computationally formulated as an attention masking mechanism. One possible extension of this practice is masking the attention matrices of different heads with different graph priors. In the original multi-head self-attention blocks, different attention heads implicitly attend to information from different representation subspaces of different nodes. While in this case, using the graph-masking mechanism to enforce the heads explicitly attend to different subspaces with graph priors further improves the model representative capability for graph data.
Source: Graph Attention Networks |
Graph attention networks (GATs), novel neural network architectures that operate on graph-structured data, leveraging masked self-attentional layers to address the shortcomings of prior methods based on graph convolutions or their approximations. By stacking layers in which nodes are able to attend over their neighborhoods' features, we enable (implicitly) specifying different weights to different nodes in a neighborhood, without requiring any kind of costly matrix operation (such as inversion) or depending on knowing the graph structure upfront. In this way, we address several key challenges of spectral-based graph neural networks simultaneously, and make our model readily applicable to inductive as well as transductive problems.
The feed-forward layer is weights that is trained during training and the exact same matrix is applied to each respective token position. Since it is applied without any communcation with or inference by other token positions it is a highly parallelizable part of the model.
- Training Vision Transformers for Image Retrieval]
- TransReID: Transformer-based Object Re-Identification]
- Video Transformer Network
- Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet
- Bottleneck Transformers for Visual Recognition
- Full Transformer Network for Image Captioning
- Learn to Dance with AIST++: Music Conditioned 3D Dance Generation
- Segmenting Transparent Object in the Wild with Transformer
- Fast Convergence of DETR with Spatially Modulated Co-Attention
- Investigating the Vision Transformer Model for Image Retrieval Tasks
- Trear: Transformer-based RGB-D Egocentric Action Recognition
- End-to-End Video Instance Segmentation with Transformers
- VisualSparta: Sparse Transformer Fragment-level Matching for Large-scale Text-to-Image Search
- TrackFormer: Multi-Object Tracking with Transformers
- Line Segment Detection Using Transformers without Edges
- Transformer Guided Geometry Model for Flow-Based Unsupervised Visual Odometr
- Transformer for Image Quality Assessment
- TransTrack: Multiple-Object Tracking with Transformer
- Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers
- TransPose: Towards Explainable Human Pose Estimation by Transformer
- Training data-efficient image transformers & distillation through attention
- 3D Object Detection with Pointformer
- Toward Transformer-Based Object Detection
- Taming Transformers for High-Resolution Image Synthesis
- SceneFormer: Indoor Scene Generation with Transformers
- PCT: Point Cloud Transformer ]
- Transformer Interpretability Beyond Attention Visualization
- End-to-End Human Pose and Mesh Reconstruction with Transformers
- Point Transformer
- Pedestrian Detection
- UP-DETR: Unsupervised Pre-training for Object Detection with Transformers
- MODELING LONG-RANGE INTERACTIONS WITHOUT ATTENTION
- General Multi-label Image Classification with Transformers
- Rethinking Transformer-based Set Prediction for Object Detection
- Pre-Trained Image Processing Transformer
- End-to-End Object Detection with Adaptive Clustering Transformer
- Visual Transformers: Token-based Image Representation and Processing for Computer Vision
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
- Deformable DETR: Deformable Transformers for End-to-End Object Detection
- End-to-end Lane Shape Prediction with Transformers
- End-to-End Object Detection with Transformers
- Feature Pyramid Transformer
- Learning Texture Transformer Network for Image Super-Resolution
- A Generalization of Transformer Networks to Graphs
- LET: Linguistic Knowledge Enhanced Graph Transformer for Chinese Short Text Matching
- Knowledge-Enhanced Hierarchical Graph Transformer Network for Multi-Behavior Recommendation
- Geometric Transformers for Protein Interface Contact Prediction
- Retrieving Complex Tables with Multi-Granular Graph Representation Learning
- HetEmotionNet: Two-Stream Heterogeneous Graph Recurrent Neural Network for Multi-modal Emotion Recognition
- A community-powered search of machine learning strategy space to find NMR property prediction models
- Edge-augmented Graph Transformers: Global Self-attention is Enough for Graphs
- Transformer for Graphs: An Overview from Architecture Perspective
- Power Law Graph Transformer for Machine Translation and Representation Learning
- HeteroQA: Learning towards Question-and-Answering through Multiple Information Sources via Heterogeneous Graph Modeling
- Unsupervised Pre-Training on Patient Population Graphs for Patient-Level Predictions
- Anomaly Detection in Dynamic Graphs via Transformer
- Activity Graph Transformer for Temporal Action Localization
- Gophormer: Ego-Graph Transformer for Node Classification
- BERT-GT: Cross-sentence n-ary relation extraction with BERT and Graph Transformer
- TransMOT: Spatial-Temporal Graph Transformer for Multiple Object Tracking
- Improved Drug-target Interaction Prediction with Intermolecular Graph Transformer
- SEA: Graph Shell Attention in Graph Neural Networks
- Dynamic Graph Representation Learning via Graph Transformer Networks
- Multivariate Realized Volatility Forecasting with Graph Neural Network
- SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning
- Zero-Shot Sketch Based Image Retrieval using Graph Transformer
- AlphaDesign: A graph protein design method and benchmark on AlphaFoldDB
- Graph Masked Autoencoder
- Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation
- TransCamP: Graph Transformer for 6-DoF Camera Pose Estimation
- On Representation Learning for Scientific News Articles Using Heterogeneous Knowledge Graphs
- Extracting Temporal Event Relation with Syntactic-Guided Temporal Graph Transformer
- GTN-ED: Event Detection Using Graph Transformer Networks
- GTAE: Graph-Transformer based Auto-Encoders for Linguistic-Constrained Text Style Transfer
- Dynamic Graph Transformer for Implicit Tag Recognition
- Graph transformer network with temporal kernel attention for skeleton-based action recognition
- Relation-aware Heterogeneous Graph Transformer based drug repurposing[Formula presented]
- Blockchain-enabled fraud discovery through abnormal smart contract detection on Ethereum
- Transformer-Based Graph Convolutional Network for Sentiment Analysis
- RGTransformer: Region-Graph Transformer for Image Representation and Few-shot Classification
- Multi-Omic Graph Transformers for Cancer Classification and Interpretation
- Relation-Aware Graph Transformer for SQL-to-Text Generation
- Assembled graph neural network using graph transformer with edges for protein model quality assessment
- Graph transformer for communities detection in social networks
- Contrastive learning of graph encoder for accelerating pedestrian trajectory prediction training
- Graph transformer networks based text representation
- Latent Memory-augmented Graph Transformer for Visual Storytelling
- Pre-training Graph Transformer with Multimodal Side Information for Recommendation
- Element graph-augmented abstractive summarization for legal public opinion news with graph transformer
- The framework design of question generation based on knowledge graph
- Document-level relation extraction via graph transformer networks and temporal convolutional networks
- Representation Learning on Knowledge Graphs for Node Importance Estimation
- Heterogeneous Temporal Graph Transformer: An Intelligent System for Evolving Android Malware Detection
- FACE-KEG: Fact Checking Explained using KnowledgE Graphs
- STGT: Forecasting pedestrian motion using spatio-temporal graph transformer
- GTAE: Graph transformer based auto-encoders for linguistic-constrained text style transfer
- Recursive non-autoregressive graph-to-graph transformer for dependency parsing with iterative refinement
- Directional Graph Transformer-Based Control Flow Embedding for Malware Classification
- Graph Transformer Attention Networks for Traffic Flow Prediction
- Stacked Graph Transformer for HIV Molecular Prediction
- User Identification in Online Social Networks using Graph Transformer Networks
- Research on Intelligent Diagnosis Model of Electronic Medical Record Based on Graph Transformer
- Propagation-Based Fake News Detection Using Graph Neural Networks with Transformer
- Graph transformer-convolution network for graph classification
- CogTree: Cognition Tree Loss for Unbiased Scene Graph Generation
- Meta Graph Transformer: A Novel Framework for Spatial–Temporal Traffic Prediction
- Learning contextual representations of citations via graph transformer
- Masked Label Prediction: Unified Message Passing Model for Semi-Supervised Classification
- Knowledge based natural answer generation via masked-graph transformer
- Graph Transformers for Characterization and Interpretation of Surgical Margins
- Structure-Function Mapping via Graph Neural Networks
- Application of Multiattention Mechanism in Power System Branch Parameter Identification
- Fraud Detection in Online Product Review Systems via Heterogeneous Graph Transformer
- Graph Transformer: Learning Better Representations for Graph Neural Networks
- Joint IAPR International Workshops on Structural, Syntactic and Statistical Techniques in Pattern Recognition, S+SSPR 2020
- Privacy-Preserving Visual Content Tagging using Graph Transformer Networks
- Forecaster: A graph transformer for forecasting spatial and time-dependent data
- Learning Bi-directional Social Influence in Information Cascades using Graph Sequence Attention Networks
- Heterogeneous Graph Transformer
- Question Generation from Knowledge Base with Graph Transformer
- Graph transformer networks with syntactic and semantic structures for event argument extraction
- Self-supervised graph transformer on large-scale molecular data
- Graph transformer for graph-to-sequence learning
- Graph-to-graph transformer for transition-based dependency parsing
- Text graph transformer for document classification
- Online back-parsing for AMR-to-text generation
- Natural Answer Generation via Graph Transformer
- Spatio-Temporal Graph Transformer Networks for Pedestrian Trajectory Prediction
- Amr-to-text generation with graph transformer
- Hierarchical Graph Transformer-Based Deep Learning Model for Large-Scale Multi-Label Text Classification
- Graph transformer networks
- Text generation from knowledge graphs with graph transformers
- Knowledge-driven encode, retrieve, paraphrase for medical image report generation
- Learning graphical state transitions
- Deep learning for efficient discriminative parsing
- Energy-based models in document recognition and computer vision
- Stochastic learning
- Machine learning for sequential data: A review
- Object recognition with gradient-based learning
- Gradient-based learning applied to document recognition
- Reading checks with multilayer graph transformer networks
- Global training of document processing systems using graph transformer networks
- Text Generation from Knowledge Graphs with Graph Transformers
- A Generalization of Transformer Networks to Graphs
- Implementation of Graph Transformer Networks(GTN)
- Universal Graph Transformer Self-Attention Networks
- Graph Augmented Transformers for Medication Recommendation
- Heterogeneous Graph Transformer (HGT)
- Graph Transformer for Graph-to-Sequence Learning
- Spatio-Temporal Graph Transformer Networks for Pedestrian Trajectory Prediction
- Gated Graph Transformers
- TorchMD-NET provides state-of-the-art graph neural networks and equivariant transformer neural networks potentials for learning molecular potentials.
- Heterogeneous Graph Transformer for Graph-to-Sequence Learning
- Graph Neural Networks for Multi-Label Classification