Paper: Pyramid Self-Attention for Semantic Segmentation
by Jiyang Qi1, Xinggang Wang1,2 *, Yao Hu3, Xu Tang3, Wenyu Liu1
1 School of EIC, HUST, 2 Hubei Key Laboratory of Smart Internet Technology, 3 Alibaba Group
(*) Corresponding Author
Self-attention is vital in computer vision since it is the building block of Transformer and can model long-range context for visual recognition. However, computing pairwise self-attention between all pixels for dense prediction tasks (e.g., semantic segmentation) costs high computation. In this work, we propose a novel pyramid self-attention (PySA) mechanism which can collect global context information far more efficiently. Concretely, the basic module of PySA first divides the whole image into R x R regions, and then further divides every region into G x G grids. One feature is extracted for each grid and then self-attention is applied to the grid features within the same region. PySA keeps increasing R (e.g., from 1 to 8) to harvest more local context information and propagate global context to local regions in a parallel/series manner. Since G can be kept as a small value, the computation complexity is low.
By keeping increasing R (e.g., from 1 to 8), models can harvest more local context information and propagate global context to local regions in a parallel/series manner.