Enhancing Egocentric 3D Pose Estimation with Third Person Views

Abstract

We propose a novel approach to enhance the 3D body pose estimation of a person computed from videos captured from a single wearable camera. The main technical contribution consists of leveraging high-level features linking first- and third-views in a joint embedding space. To learn such embedding space we introduce First2Third-Pose, a new paired synchronized dataset of nearly 2,000 videos depicting human activities captured from both first- and third-view perspectives. We explicitly consider spatial- and motion-domain features, combined using a semi-Siamese architecture trained in a self-supervised fashion. Experimental results demonstrate that the joint multi-view embedded space learned with our dataset is useful to extract discriminatory features from arbitrary single-view egocentric videos, with no need to perform any sort of domain adaptation or knowledge of camera parameters. An extensive evaluation demonstrates that we achieve significant improvement in egocentric 3D body pose estimation performance on two unconstrained datasets, over three supervised state-of-the-art approaches. Our dataset and code will be available for research purposes.

Links to Code will be updated soon!

Pattern Recognition Journal Paper link

[Code] Coming SOON!

Data

Trained Siamese Model

Approach Overview

Our model uses a semi-Siamese architecture to learn to detect if a pair of first- and third-view videos of the First2Third paired source dataset are syncronized or not, by minimizing a contrastive loss green arrows. %Each stream of the semi-Siamese network takes as inputs stacked RGB and optical flow frames.

This pretext task leads to learn a joint embedding space, where the gap between the first-view and third-view worlds is minimized. The so learned joint embedding space can in principle be leveraged by any supervised method for 3D egopose estimation on a target dataset, without a need for domain adaptation. At both train time brown arrows and test time blue arrows, the semi-Siamese network is used for feature projection onto the learned joint embedded space. z is 64- dimensional vector, obtained once removed the softmax layer of the Siamese network pre-trained with our dataset.

Dataset Examples

Enric lab egoview	Enric lab frontview	Enric lab sideview	Enric lab topview

Shahrukh outdoor egoview	Shahrukh outdoor frontview	Shahrukh outdoor sideview

Authors

Ameya Dhamanaskar, Mariella Dimiccoli, Enric Corona, Albert Pumarola, Francesc Moreno Noguer.

Reference

If you use First2Third-Pose in your research or wish to refer to the baseline results published in the paper, please use the following BibTeX entry.

@article{DHAMANASKAR2023109358,
title = {Enhancing egocentric 3D pose estimation with third person views},
journal = {Pattern Recognition},
volume = {138},
pages = {109358},
year = {2023},
issn = {0031-3203},
doi = {https://doi.org/10.1016/j.patcog.2023.109358},
url = {https://www.sciencedirect.com/science/article/pii/S0031320323000596},
author = {Ameya Dhamanaskar and Mariella Dimiccoli and Enric Corona and Albert Pumarola and Francesc Moreno-Noguer},
keywords = {3D pose estimation, Self-supervised learning, Egocentric vision}
}

nudlesoup / First2Third-Pose