tonible14012002/til-blog

title

description

cover

Introduction

Full body haman pose estimation is one of the fundamentals in computer visions field. With many different applications derived from it.

The survey only focus on top-down approach, which first detects human bounding boxes and then estimates the pose for each box.

Top-Down Framework

Limitation

Top down approach's performances are dominant on common benchmarks.
Detection state and esimator are separated so if Detection fail => No recover for the pose estimator.

Solutions

Lower the detection confidence and NMS threshold (can raise redundant boxes).
Eliminated redundant boxes by a parametric pose NMS (Using novel pose distance metric to compare pose similarity).
Optimize with a data-diven approach .
Speed up top-down framework by a multi-stage concurrent pipeline in AlphaPose that enable running in realtime.

AlphaPose Improvement

AlphaPose use a novel symmetric integral keypoints regression method that can localize keypoints in different scales accurately.
Extend pose guided proposal generator to in coporate with Multi-Domain Knowledge Distillation - incorporate training data from separate body part datasets.
Novel part-guided humanproposal generator (PGPG) to augment training samples.
Annotate new whole-body post estimation benchmark (136 points for each person).
A Pose-aware identity embedding - enable simultaneous human pose tracking. A person re-id branch is attached on pose estimator.

This design allowing realtime Pose estimation and tracking in an unified manner.

Whole-Body Keypoint Localization

Limitation of many Previous Methods

Goal: unified detectoin of body, face, hand, foot keypoints for multi-person.

Openpose - detecs body keypoints using PAFs. Then, estimate face landmarks and hand keypoints using two separate networks -> Such design consumes extra computation resources and time inefficient.
Hidalgo using single network for whole body estimate -> Output resolution is limited and decrease performence on fine-level keypoints (faces, handes).
ZoomNet - ROIAlign to crop hand, face region on feature maps then predict keypoints on resized one.

All methods above use heatmap for keypoints representation.

Heatmap Limitation

One of the major problem is Quantization error.
Unsuitable for Localizing keypoints of body, faces, hands simultaneously.
Draw back - Although Heatmaps are mostly used for keypoint representation. However, its size is usually quater of the input image.
Heatmap representation is discrete so it may miss the correct position (not a problem in body-only estimation. However not good for fine-level keypoints on hands, face).

Use additional sub-networks for hand and face estimation and ROI-Align to enlarge the feature map are current solutions. However, both methods are computation expensive (especially for full body estimation).

Adopt Soft-Argmax For keypoints Representation

Many works have proved that Soft-argmax based integral regression is more suitable for whole-body keypoints. However, studies show some drawback of using soft-argmax.

Issues

+ Asymmetric gradient Problem
+ Size-dependent Keypoint Scoring Problem

__AlphaPose Solved these problems ann provided a new regression method for higher accuracy__

Multi Person Pose Tracking

Pose Tracking is extended from multi person pose estimation in videos. It aim to linking corresponding individual body across multiple frames and output a sequence of poses and how it changed over time. A pose Tracker is usefull for action recognition tasks. Similar to Pose estimation, it also categorized in to Topdown and BottomUp approac

Limitation Of Previous Methods

Build temporal and Spatial graphs and solving optimization problem -> prevent graph-cut optimization from running in online manner -> Time consumming, memory-inefficient
Ultilizes 3D MaskRCNN to estimate person tube, poses simultaneously -> Input a whole video sequence -> disable online tracking.
Forward and Backward bounding box propagation strategy to eliminate essue of missed detection -> The same as above.

These methods rely on the spatial continuity of poses only.

Adopt Re-ID feature

A pose-guided re-ID feature extraction is designed to avoid potential background noise.
Multi-stage information merging method to ultilize boxes, poses, re-ID features simultaneously.

Notes

Confidence Thresshold
Localizing Keypoints: process of joining keypoints (determine the relative position of keypoints in a human body).
NMS - Nonmax Supression: Eliminate the redundant Bounding box.
- Discard all boxes with confidence < some threshold.
- While there are any remaining boxes Pick the highest confidence and Discard any remain box with IoU > 0.5 (high overlap) with the box just picked.
ROI-Pooling - align and pool the features from regions of interst in to a fixed size feature map
- Input an image with corresponding feature map with ROIs (bounding boxes).
- ROI-Pooling devides each ROI into fixed grid (2x2, 3x3, ..).
- ROI-Pooling quantizes the location to the nearest feature map cell, and it pools the feature values in that cell.
- Output a fixed-size representation for each ROI.
ROI-Align - Region of Interest Align, an improvement over ROI-Pooling method. Instead of quantizing the ROI locations on step 3,RO-Align uses bilinear interpolation to sample feature map at sub-pixel locations within the ROI and result in a more accurate alignment between the ROI and feature map. Then is typically pools or aggregates the features from the interpolated values to obtain a fixed-size representation for each ROI.
- Improved spatial information.
- Higher accurate localization of object boundaries which suitable for object detection task that previce object localization is crucial.

tonible14012002 / til-blog