awsome-cuda-attention

Collection of effcient implementations of attention with CUDA.

Flash Attention

These two papers describe how to implement Exact attention while avoid O(n^2) memory occupied by intermediate attenion matrix (i.e. QK^T).

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
Self-attention Does Not Need O(n2) Memory

Implementations can be found at:

Official implementation by FlashAttention authors
Cutlass official example
OpenAI Triton project

Block Sparse Attention

Sparse GEMMs are not friendly with GPUs due to poor spatial-temporal locality. But stuctured block sparsity solves this problem. See OpenAI's blog for details.

OpenAI's implementation with manual PTX optimization
HuggingFace's implementation with Cutlass
TileSparse
OpenAI triton project

Reference

Cutlass: CUDA Templates for Linear Algebra Subroutines
Triton: Open-Source GPU Programming for Neural Networks

rayleizhu / awsome-cuda-attention

awsome-cuda-attention

Flash Attention

Block Sparse Attention

Reference

About