Introduction

Human pose estimation deeply relies on visual clues and constraint clues between parts to locate keypoints. Most existing CNN-based methods do well in visual representation, however, lacking in the ability to explicitly learn the constraint relationships between keypoints. In this paper, we propose a novel approach based on Token representation for human Pose estimation (TokenPose).

The contributions of this work are summarized as follows:

We propose to use tokens to represent each keypoint entity. In this way, visual cue learning and constraint cue learning are explicitly incorporated into a unified framework.
Both hybrid and pure Transformer-based architectures are explored in this work. To the best of our knowledge, our proposed TokenPose-T is the first pure Transformer-based model for 2D human pose estimation.
We conduct experiments over two widely-used benchmark datasets: COCO keypoint detection dataset and MPII Human Pose dataset. TokenPose achieves competitive state-of-the-art performance with much fewer parameters and computation cost compared with existing CNN-based counterparts.

arxiv 2104.03516

About

Implementation for : TokenPose: Learning Keypoint Tokens for Human Pose Estimation （https://arxiv.org/abs/2104.03516）

Languages

Language:Cuda 63.9%Language:Python 36.1%Language:C++ 0.0%Language:Makefile 0.0%