awesome-fast-attention

A curated list of efficient attention modules (last update: Sun, 28 Feb 2021 11:40:43 +0000)

Efficient Attention
Articles/Surveys/Benchmarks

Efficient Attention

Paper (citations)	Implementation	Computational Complexity	AutoRegressive	Main Idea
Generating Wikipedia by Summarizing Long Sequences (278)	memory-compressed-attention	$\mathcal{O}({b}\cdot\frac{N}{b}\cdot\frac{N}{{b}\cdot{k}}\cdot{D})$	✔️	EXPAND compresses key and value + blocked attention
CBAM: Convolutional Block Attention Module (999+)	attention-module	$\mathcal{O}(({N}\cdot{D}+\frac{{D}^2}{r})+({N}\cdot{D}\cdot{k}^2))$	❌	EXPAND combines the SE attention with a per pixel(local) weight
Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks (16)	set_transformer	$\mathcal{O}({N}\cdot{K}\cdot{D})$	❌	EXPAND uses K relay nodes
CCNet: Criss-Cross Attention for Semantic Segmentation (290)	CCNet	$\mathcal{O}({N}\cdot({H}+{W})\cdot{D})$	❌	EXPAND each pixel attends to its row and column simultaneously
Efficient Attention: Attention with Linear Complexities (16)	efficient-attention	$\mathcal{O}({N}\cdot{D}^2)$	❌	EXPAND Softmax(Q)(Softmax(K^T)V)
Star-Transformer (40)	fastNLP	$\mathcal{O}({N}\cdot{D})$	❌	EXPAND uses a relay(global) node and attends to/from that node
GCNet: Non-local Networks Meet Squeeze-Excitation Networks and Beyond (196)	GCNet	$\mathcal{O}({N}\cdot{D}^2)$	❌	EXPAND squeeze and excitation with an attention pooling (instead of a GAP)
Generating Long Sequences with Sparse Transformers (249)	DeepSpeed	$\mathcal{O}({N}\cdot\sqrt{N}\cdot{D})$	✔️	EXPAND sparse block based attention
SCRAM: Spatially Coherent Randomized Attention Maps (1)	-	$\mathcal{O}({N}\cdot\log({N})\cdot{D})$	✔️	EXPAND uses PatchMatch to find close keys
Interlaced Sparse Self-Attention for Semantic Segmentation (23)	IN_PAPER	$\mathcal{O}({N}\cdot{D}^2+{N}\cdot\sqrt{N}\cdot{D})$	✔️	EXPAND combination of a short length and then long range(dilated) attention
Permutohedral Attention Module for Efficient Non-Local Neural Networks (3)	Permutohedral_attention_module	$\mathcal{O}({N}\cdot{D}^2)$	❌	EXPAND uses permutohedral lattice approximation algorithm to approximate the attention output
Large Memory Layers with Product Keys (42)	XLM	$\mathcal{O}({Q}\cdot({K}+{k}^2)\cdot{D})$	✔️	EXPAND search for nearest neighbor keys
Expectation-Maximization Attention Networks for Semantic Segmentation (78)	EMANet	$\mathcal{O}({N}\cdot{k}\cdot{D})$	❌	EXPAND applys expectation maximization to cluster keys into k clusters
BP-Transformer: Modelling Long-Range Context via Binary Partitioning (15)	BPT	$\mathcal{O}({N}\cdot{k}\cdot\log(\frac{N}{k})\cdot{D})$	✔️	EXPAND attends to distant tokens coarsely and attends to close tokens in a more fine-grained manner
Compressive Transformers for Long-Range Sequence Modelling (47)	compressive-transformer-pytorch	$\mathcal{O}({N}^2\cdot{D})$	✔️	EXPAND compresses distant tokens instead of just stop_grad() ing them, more efficient version of transformerXL
Axial Attention in Multidimensional Transformers (30)	axial-attention	$\mathcal{O}({N}\cdot({H}+{W})\cdot{D})$	✔️	EXPAND apply attention on each axis separately
Reformer: The Efficient Transformer (208)	trax	$\mathcal{O}({N}\cdot\log({N})\cdot{D}^2)$	✔️	EXPAND uses LSH to find close keys
Sparse Sinkhorn Attention (15)	sinkhorn-transformer	$\mathcal{O}(\frac{{N}^2}{n_b}+{n_b}^2)$	✔️	EXPAND uses a cost matrix to limit attention between buckets
Transformer on a Diet (2)	transformer-on-diet	$\mathcal{O}({N}\cdot{k}\cdot{D})$	✔️	EXPAND dilated transformer like wavenet
SAC: Accelerating and Structuring Self-Attention via Sparse Adaptive Connection (2)	-	$\mathcal{O}({N}\cdot{k}\cdot{D})$	✔️	EXPAND learns the q, k connections == dynamically creates a sparse attention matrix
Efficient Content-Based Sparse Attention with Routing Transformers (36)	routing-transformer	$\mathcal{O}({N}\cdot\sqrt{N}\cdot{D})$	✔️	EXPAND computes attention with same-cluster tokens (computed by online k-means)
Neural Architecture Search for Lightweight Non-Local Networks (10)	AutoNL	$\mathcal{O}((\frac{H}{h}\cdot\frac{W}{w})\cdot(\frac{D}{k})^2)$	❌	EXPAND computes Q(KV) and also down samples q, k, v both in spatial and channel dimensions
ETC: Encoding Long and Structured Inputs in Transformers (14)	-	$\mathcal{O}(({N}\cdot{g}+{g}^2+{N}\cdot{k})\cdot{D})$	❌	EXPAND combines global attention (star transformer with multiple global tokens) with local attention
Longformer: The Long-Document Transformer (151)	longformer	$\mathcal{O}({N}\cdot({k}+{g})\cdot{D})$	✔️	EXPAND global + blocked attention
Multi-scale Transformer Language Models (2)	IN_PAPER	$\mathcal{O}({N}^2\cdot{D})$	✔️	EXPAND UNet like + retina attetion is something close to BP-Transformer
Synthesizer: Rethinking Self-Attention in Transformer Models (24)	Synthesizer-Rethinking-Self-Attention-Transformer-Models	$\mathcal{O}({N}^2\cdot{D})$	✔️	EXPAND does not compute pairwise interactions
Jukebox: A Generative Model for Music (42)	jukebox	$\mathcal{O}({N}\cdot\sqrt{N}\cdot{D})$	✔️	EXPAND better attention patterns from Sparse Transformer
Input-independent Attention Weights Are Expressive Enough: A Study of Attention in Self-supervised Audio Transformers (0)	-	$\mathcal{O}({N}^2\cdot{D})$	✔️	EXPAND does not compute pairwise interactions and uses fixed mask patters
GMAT: Global Memory Augmentation for Transformers (2)	gmat	$\mathcal{O}({m}\cdot({N}+{m})\cdot{D})$	❌	EXPAND adds global tokens
Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention (41)	fast-transformers	$\mathcal{O}({N}\cdot{D}^2)$	✔️	EXPAND uses phi(q)(phi(k)v) and also improves the sequential sampling step
Linformer: Self-Attention with Linear Complexity (43)	linformer-pytorch	$\mathcal{O}({N}\cdot{k}\cdot{D})$	❌	EXPAND project key and value from nd to kd
Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers (7)	google-research	$\mathcal{O}({N}\cdot{D}^2\cdot\log({D}))$	✔️	EXPAND calculate an unbiased stochastic approximation of the attention matrix
Kronecker Attention Networks (1)	kronecker-attention-pytorch	$\mathcal{O}(({H}+{W})^2\cdot{D})$	❌	EXPAND uses horizontal and lateral average matrices
Real-time Semantic Segmentation with Fast Attention (5)	-	$\mathcal{O}({N}\cdot{D}^2)$	❌	EXPAND l2_norm(q)(l2_norm(k)v)
Big Bird: Transformers for Longer Sequences (57)	DeepSpeed	$\mathcal{O}(({g}^2+{N}\cdot({k}+{g}+{r}))\cdot{D})$	❌	EXPAND ETC with random connections
Fast Transformers with Clustered Attention (6)	fast-transformers	$\mathcal{O}({N}\cdot{k}\cdot{D})$	❌	EXPAND groups queries together with LSH
Tensor Low-Rank Reconstruction for Semantic Segmentation (3)	-	$\mathcal{O}(({D}\cdot{H}\cdot{W}+{D}^2+{H}^2+{W}^2)\cdot{r})$	❌	EXPAND decompose the full attention tensor into rank one tensors (CP decomposition)
Looking for change? Roll the Dice and demand Attention (0)	IN_PAPER	$\mathcal{O}({H}\cdot{W}\cdot{D})$	❌	EXPAND uses the fractal tanimoto similarity to compare queries with keys inside the attention module
Rethinking Attention with Performers (25)	google-research	$\mathcal{O}({N}\cdot{m}\cdot{D})$	✔️	EXPAND unbiased approximation of the attention matrix with softmax kernel
Memformer: The Memory-Augmented Transformer (0)	memformer	$\mathcal{O}({N}\cdot{D})$	✔️	EXPAND attend to memory slots + Memory-Replay BackPropagation
SMYRF: Efficient Attention using Asymmetric Clustering (1)	smyrf	$\mathcal{O}({N}\cdot\log({N})\cdot{D})$	❌	EXPAND LSH with balanced clusters
Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting (0)	Informer2020	$\mathcal{O}({N}\cdot\log({N})\cdot{D})$	✔️	EXPAND sparse attention + funnel like encoder
Sub-Linear Memory: How to Make Performers SLiM (0)	google-research	$\mathcal{O}({N}\cdot{m}\cdot{D})$	✔️	EXPAND Performer but with sublinear Memory usage
Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention (0)	Nystromformer	$\mathcal{O}({N}\cdot{D})$	❌	EXPAND uses Nystrom method to approximate the attention matrix
Linear Transformers Are Secretly Fast Weight Memory Systems (0)	fast-weight-transformers	$\mathcal{O}({N}\cdot{m}\cdot{D})$	✔️	EXPAND show that linear transformers are basically fast weight networks + propose a new kernel function to linearise attention, balancing simplicity and effectiveness
LambdaNetworks: Modeling Long-Range Interactions Without Attention (5)	lambda-networks	$\mathcal{O}({N}^2\cdot{k}\cdot\frac{v}{h})$	✔️	EXPAND generates a linear layer based on context + decouple pos/context

supergodv / awesome-fast-attention

awesome-fast-attention

Table of Contents

Efficient Attention

Articles/Surveys/Benchmarks

About

Languages