This repository is associated with the paper published in SC21.
E.T. Re-Thinking Self-Attention for Transformer Models on GPUs
It contains some implemented kernels mentioned in the paper and a few examples of encoder.
Tested on NVIDIA V100S GPU with CUDA 11.4.
There are three examples of encoders in test
, all of which use random data.
- On-the-fly attention with tensor-tile pruned linear transformations (
encoder_tile_test
) - Attention-aware pruning with pruned self-attention (
encoder_prune_test
) - Sequence-aware optimized encoder (
encoder_length_test
)
mkdir build && cd build
cmake ..
make -j