- CycleMLP: A MLP-like Architecture for Dense Prediction
- AS-MLP: An Axial Shifted MLP Architecture for Vision
- Global Filter Networks for Image Classification
- Rethinking Token-Mixing MLP for MLP-based Vision Backbone
- Vision Permutator: A Permutable MLP-Like Architecture for Visual Recognition
- S^2-MLP: Spatial-Shift MLP Architecture for Vision
- Graph-MLP: Node Classification without Message Passing in Graph
- When Vision Transformers Outperform ResNets without Pretraining or Strong Data Augmentations
- Container: Context Aggregation Network
- SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers
- Can Attention Enable MLPs To Catch Up With CNNs
- Transformer-Based Deep Image Matching for Generalizable Person Re-identification
- Less is More: Pay Less Attention in Vision Transformers
- MixerGAN: An MLP-Based Architecture for Unpaired Image-to-Image Translation
- A remark on a paper of Krotov and Hopfield
- Pay Attention to MLPs
- FNet: Mixing Tokens with Fourier Transforms
- ResMLP: Feedforward networks for image classification with data-efficient training
- Are Pre-trained Convolutions Better than Pre-trained Transformers?
- Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet
- Beyond Self-attention: External Attention using Two Linear Layers for Visual Tasks
- RepMLP: Re-parameterizing Convolutions into Fully-connected Layers for Image Recognition
- MLP-Mixer: An all-MLP Architecture for Vision
- Synthesizer: Rethinking Self-Attention in Transformer Models