vchoutas / smplify-x

Expressive Body Capture: 3D Hands, Face, and Body from a Single Image

Home Page:https://smpl-x.is.tue.mpg.de/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

estimating camera and shape only once for the whole sequence?

matejhof opened this issue · comments

In the way we are running SMPLify, it reestimates shape and camera position with every frame. In our datasets, we know that it was the same person (same shape) and that the camera position was fixed. Could we estimate the shape and the camera from the whole session (e.g., 5-min. video), then fix it and run through the video again and estimate only the poses?

From what I see, the camera optimisation is here, lines 313 to 368:
https://github.com/vchoutas/smplify-x/blob/master/smplifyx/fit_single_frame.py
This function is called directly by the main, line 245, in the loop that processes frames individually.

To avoid re-estimation every frame, I guess the easiest way is to run this only on the first frame, return the camera settings, and then use these settings for subsequent loops. There's also some camera reset / initialisation to take care of in a similar way, lines 266 to 303.

As for shape, my guess would be that it is within the functions called between the lines 389 to 439, which optimise the body model.

However, this comes with the drawback that if the first frame's settings aren't that good, it might screw up the rest.

Your suggestion to take estimations from the whole session comes with significant extra processing cost, as well as a decision problem about which settings to use:

  • Have a metric that depends on camera settings? How is this metric calculated, and what are the parameters to take into account? Maybe the distance between the camera position and the person? Select the mean? Select what is closest to reality, if the distance is known? It also does not guarantee the accuracy of the body model.
  • Maybe another metric, such as the volume of the 3D model, would be more appropriate, but the volume of a human shape is hard to calculate.
  • Make a re-projection of the 3D onto the 2D image, and select the settings of the frame that has the highest overlap? This seems to be a metric that is used by a similar tool during the optimisation process (https://github.com/nkolot/SPIN), but it will most likely increase the processing time significantly. And as far as I know, detecting shapes such as the human body is not trivial, and might be prone to errors as well.
  • Make a re-projection of the 3D onto the 2D image, and select the settings of the frame that has the highest overlap? This seems to be a metric that is used by a similar tool during the optimisation process (https://github.com/nkolot/SPIN), but it will most likely increase the processing time significantly. And as far as I know, detecting shapes such as the human body is not trivial, and might be prone to errors as well.

It seems that current estimates of the shape of the human body are not particularly accurate, and after my experiments, I found that I could not estimate inputs that were too fat or too thin