Visualizing the learned space-time attention

This repository contains implementations of Attention Rollout for TimeSformer model.

Attention Rollout was introduced in paper Quantifying Attention Flow in Transformers. It is a method to use attention weights to understand how a self-attention network works, and provides valuable insights into which part of the input is the most important when generating the output.

It assumes the attention weights determine the proportion of the incoming information that can propagate through the layers and we can use attention weights as an approximation of how information flow between layers. If A is a 2-D attention weight matrix at layer l, A[i,j] would represent the attention of token i at layer l to token j from layer l-1. And to compute the attention to the input tokens from the video, it recursively multiply the attention weights matrices, starting from the input layer up to layer l.

Implementating Attention Rollout for TimeSformer

For divided space-time attention, each token has 2 dimensions, let's denote the token as z(p,t), where p is spatial dimension and t is the time dimension;

Each encoding block contains a time attention layer and a space attention layer. During time attention block, each patch token only attends to patches at same spatial locations; During space attention, each patch only attends to the patches from same frame. If we use T and S to denote time attention weights and space attention weights respectively,T[p,j,q] would represent the attention of z(p,j) to z(p,q) from previous layer during time attention layer and S[i,j,p] would represent the space attention of z(i,j) to z(p,j) from time attention layer;

When we combined the space and time attention, each patch token will attends to all patches at every spatial locations from all frames (with the exception of the cls_token, we will discuss about it later) through an unique path. The attention path of z(i,j) to z(p,q) (where p != 0) is

space attention: z(i,j)-> z(p,j)
time attention: z(p,j)-> z(p,q)

we can calculate the combined space time attention W as

W[i,j,p,q] = S[i,j,p]* T[p,j,q]

note that the classification token did not participate in the time attention layer - it was removed from the input before it enter the time attention layer and added back before passing to the space attention layer. This means it only attends to itself during time attention computation, we use an identity matrix to account for this. Since classification did not participate in time attention computation, all the tokens will only be able to attend to classification token from same frame, to address this limitation, in TimeSformer implementation, the cls_token output is averaged across all frames at end of each space-time attention block, so that it will be able to carry information from other frames, we also need to average its attention to all input tokens when we compute the combined space time attention

Usage

Here is a notebook demostrate how to use attention rollout to visualize space time attention learnt from TimeSformer

a colab notebook: Visualizing learned space time attention with attention rollout