Disentangling Human Dynamics for Pedestrian Locomotion Forecasting with Noisy Supervision

Question

Disentangling Human Dynamics for Pedestrian Locomotion Forecasting with Noisy Supervision

Fujiry0 opened this issue 4 years ago · comments

Paper: https://arxiv.org/pdf/1911.01138.pdf

Summary: Proposed a method for human pose completion and forecasting in egocentric cameras.

Comment: We could see the overall idea, but the explanation are not detailed enough to replicate the exact algorithm. Figures can notations can be improved much further to make it easier to understand.

Does not consider interactions between objects.

Details

Method: Multiple modules of DNN
Input: Incomplete human pose from past t_p frames
Output: Complete human pose from past t_p frames to future t frames
Missing joint completion: Train an autoencoder-like model with multiple frames of incomplete human poses as input to denoise and complete the missing joints (with dropouts, likely to put 0 for less confidence joint, explanation not given how normalization of joints are done)
Disentangle local and global joint locations: Separate a sequence of poses into
- global motion -- movement thru the frame
- local motion --- movement of joints with respect to the neck joint
- In the final prediction, these are summed to make the complete pose prediction
Prediction: Input previous completed poses, depth of the joints, and egomotion transformation matrix into a QuasiRNN
Other sub-modules
- OpenPose for pose detection (incomplete pose)
- SuperDepth for monocular depth
- [Zhou et al. 2017] for egomotion estimation

Figure

In the above figure, the paper simply throws depth, egomotion, and pose into the network, i.e., they did not put the different frames in the same 3D reference coordinate. For depth, they take the depth estimate of each joint and input them into the NN (not inputting the whole depth image).

Project Page: https://karttikeya.github.io/publication/plf/ (nocode)