mv3dpose

Off-the-shelf Multiple Person Multiple View 3D Pose Estimation.

Cite

If this repository is useful to you, please cite:

@inproceedings{tanke2019iterative,
  title={Iterative Greedy Matching for 3D Human Pose Tracking from Multiple Views},
  author={Tanke, Julian and Gall, Juergen},
  booktitle={German Conference on Pattern Recognition},
  year={2019}
}

Abstract

In this work we propose an approach for estimating 3D human poses of multiple people from a set of calibrated cameras. Estimating 3D human poses from multiple views has several compelling properties: human poses are estimated within a global coordinate space and multiple cameras provide an extended field of view which helps in resolving ambiguities, occlusions and motion blurs. Our approach builds upon a real-time 2D multi-person pose estimation system and greedily solves the association problem between multiple views. We utilize bipartite matching to track multiple people over multiple frames. This proofs to be especially efficient as problems associated with greedy matching such as occlusion can be easily resolved in 3D. Our approach achieves state-of-the-art results on popular benchmarks and may serve as a baseline for future work.

Install

This project requires nvidia-docker and drivers that support cuda 10.

Clone this repository with its submodules as follows:

git clone --recursive https://github.com/jutanke/mv3dpose.git

Usage

Your dataset must reside in a pre-defined folder structure:

dataset
- dataset.json
- cameras
  - camera00
    - frame00xxxxxxm.json
  - camera01
    - frame00xxxxxxm.json
  - ...
  - camera_n
    - frame00xxxxxxm.json
- videos
  - camera00
    - frame00xxxxxxm.png
  - camera01
    - frame00xxxxxxm.png
  - ...
  - camera_n
    - frame00xxxxxxm.png

The file names per frame utilize the following schema:

"frame%09d.{png/json}"

The camera json files follow two types of structures: A simple camera with only the projection matrix and width and height:

{
  "P" : [ 3 x 4 ],
  "w" : int(width),
  "h" : int(height)
}

or a more complex camera setup with distortion coefficients. This camera is based on OpenCV.

{
  "K" : [ 3 x 3 ], /* intrinsic paramters */
  "rvec": [ 1 x 3 ], /* rotation vector */
  "tvec": [ 1 x 3 ], /* translation vector */
  "discCoef": [ 1 x 5 ], /* distortion coefficient */
  "w" : int(width),
  "h" : int(height)
}

The system expects a camera for each view at each point in time. If your dataset uses fixed cameras you will need to simply repeat them for all frames.

The dataset.json file contains general information for the model:

{
  "n_cameras": int(#cameras), /* number of cameras */
  "scale_to_mm": 1, /* scales the calibration to mm */
}

The variable scale_to_mm is needed as we operate in [mm] but calibrations might be in other metrics. For example, when the calibration is done in meters, scale_to_mm must be set to 1000.

optional Parameters

valid_frames: if frames do not start at 0 and/or are not continious you can set a list of frames here
epi_threshold: epipolar line distance threshold in PIXEL
max_distance_between_tracks: maximal distance in [mm] between tracks so that they can be associated
min_track_length: drop any track which is shorter than min_track_length frames
last_seen_delay: allow to skip last_seen_delay frames for connecting a lost track
smoothing_sigma: sigma value for Gaussian smoothing of tracks
smoothing_interpolation_range: define how far fill-ins should be reaching
do_smoothing: should smoothing be done at all? (Default is True)

Run the system

./mvpose.sh /path/to/your/dataset

The resulting tracks will be in your dataset folder under tracks3d, each track represents a single person. The files are organised as follows:

{
  "J": int(joint number), /* number of joints */
  "frames": [int, int], /* ordered list of the frames where this track is residing */
  "poses": [ n_frames x J x 3 ] /* 3D poses, 3d location OR None, if joint is missing */
}

jutanke / mv3dpose