zhengqili / Neural-Scene-Flow-Fields

PyTorch implementation of paper "Neural Scene Flow Fields for Space-Time View Synthesis of Dynamic Scenes"

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

NSFF Quality on Custom Dataset

breuckelen opened this issue · comments

Hi all. I'm trying to run this method on custom data, with mixed success so far. I was wondering if you had any insight about what might be happening. I'm attaching some videos and images to facilitate discussion.

  1. First of all, we're able to run static NeRF using poses from colmap, and it seems to do fine
01_nerf_result.mp4
  1. Likewise, setting the dynamic blending weight to 0 in your model, and using only the color reconstruction loss produces plausible results (novel view synthesis result below, for fixed time)

02_nsff_static_only

  1. Using the dynamic model, while setting all the frame indices to 0 should also emulate a static NeRF. It does alright, but includes some strange haze

03_nsff_frame_zero

  1. Finally, running NSFF on our full video sequence with all losses for 130k iterations produces a lot of ghosting (04_nsff_result.mp4).
04_nsff_result.mp4

Even though the data driven monocular depth / flow losses are phased out during training, I wonder if monocular depth is perhaps causing these issues? Although again the monocular depth and flow both look reasonable.

05_depth
06_flow

Let me know if you have any insights about what might be going on, and how we can improve quality here -- I'm a bit stumped at the moment. I'm also happy to send you the short dataset / video sequence that we're using if you'd like to take a look.

All the best,
~Ben

Hi,

Could you share the input video sequence?

I recently do find that for the dynamic model, if camera ego-motion is small (i.e, camera baseline is small in consecutive video frames), our local representation sometimes has difficulty reconstructing the scene very well due to our neighborhood local temporal warping (it usually works better if the camera is moving fast like the examples we show).

Thanks for the quick response, and for your insight about when the method struggles. I've uploaded the video sequence to google drive here: https://drive.google.com/file/d/1C7XHilFxdpc9pcgfMieo7BuH0Y3Y1qeR/view?usp=sharing

commented

@zhengqili Is there any theoretic reason why this happens, or this is pure observation? In my opinion the camera baseline shouldn't matter if COLMAP still estimates the poses correctly. Or do you mean that small baseline causes larger error for COLMAP (relative to the baseline)?

My thought is that a small camera baseline could lead to a degenerate solution for geometry reconstruction. In classical SfM, SLAM or MVS, we have to choose input frames that contain enough motion parallax (either using keyframe selection or input subsampled frames from the original video) before performing triangulation, otherwise, the triangulation solver can be ill-conditioned.

NeRF is very beautiful for a static scene because it has no such problem thanks to its unified 3D global representation. But for our 4D approach, you can think we are doing local triangulation in a small time window, so if the camera baseline is small or the background can be better modeled as Homography matrix than Essential matrix (which actually is the case in this video), the reconstruction can be stuck in a degenerated minimum.

commented

Got it, but in my opinion

the background can be better modeled as Homography matrix than Essential matrix

this happens when there is no prior knowledge of the geometry.
However here we have monodepth estimation which seems reasonably accurate, is that still not enough? In your opinion what can we do to solve this issue, or we cannot do anything? Like the other issue #15 , it seems that we should be very selective in the video so that nsff works.

commented

Yes, I also have the impression that the weight is decayed too fast, and at later epochs the depth becomes weird. What I have tried is to decay the weight very slowly, I find this helpful in my scenes, but don't know if it generalizes to other scenes.

Hi @breuckelen, I quickly tried this video. Since I have graduated and I don't have a lot of GPUs currently, I just use 3 frames consistency by disabling chaining scene flow during training (but the results should be very similar for view interpolation). I also subsampled the input frames by 1/2 since all the hyperparameters were validated on ~30 frames sequences.

If I use the default view interpolation camera path in my original github code, it does not look so bad (see the first video below), although it still contains some ghosting. However, if I change to larger viewpoint change, the ghosting will be more severe. To investigate this issue, I tried to render images only from the dynamic model (in function "render_bullet_time", change "rgb_map_ref" to "rgb_map_ref_dy"), the ghosting seems to disappear (see the second video).

My feeling is that the blending weights sometimes do not interpolate very well for larger viewpoint change in this video.

moving-box-full.mp4
moving-box-dynamic.mp4

Thanks! To confirm, is the second video is a rendering of the dynamic model only? Also, what is your N_rand (ray batch size) set to here? 1024?

Yes. In function "render_bullet_time()", you can change "rgb_map_ref" to "rgb_map_ref_dy" to render images from dynamic model only. My N_rand is still 1024, num_extra_sample is set 128 due to the limited GPU of my own machine :), but it should not be any difference in this case.

This is not relevant to the current implementation, but if you are interested in how to fix ghosting for the full model, I found some simple modification that can help reduce ghosting:

(1) Adding an entropy loss for blending weight to the total losses:
entropy_loss = 1e-3 * torch.mean(- ret['raw_blend_w'] * torch.log(ret['raw_blend_w'] + 1e-8))

This loss encourages blending weight to be either 0 or 1, which can help to reduce the ghosting caused by learned semi-transparent blending weight.

(2) conditioning predicted time-invariant blending weights on RGBA from the dynamic (time-dependent) model. This helps static model have better interpolation abability in unseen region during rendering. You need to modify rigid_nerf class similar to the following:
In init:
self.w_linear = nn.Sequential(nn.Linear(W + 4, W), nn.ReLU(), nn.Linear(W, 1))
In forward function:
blend_w = nn.functional.sigmoid(self.w_linear(torch.cat([input_rgba_dy, h], -1)))

The rendering results (trained after 150K iterations) shown below are much better. I haven't tried these modifications for a lot of videos, but it's worth trying if you see the ghosting effects.

moving_box_bt-15.mp4
moving_box_slowmo-bt.mp4

If I use the default view interpolation camera path in my original github code

Hi, what do you mean using "the default view interpolation camera path" and how to modify the viewpoint change? Could you please give some instructions on COLMAP SfM, I read the project you give and followed its prosedure, but still didn't get the same camera parameters as you provide, using the kid-running data.
Thanks a lot!

I'm curious -- is the spiral meant to track the input camera motion in the second video? Or is the entire scene moving (and being reproduced in the dynamic network) due to inaccurate input poses from colmap?

The output camera trajectory is circular camera pose offset I create + interpolated input camera poses based on current fractional time indices, you can check how it works in function render_slowmo_bt:

commented

@zhengqili For this scene I observe that sometimes the dynamic part gets explained by the viewing direction, I'm trying to remove view dependency in training. Have you encountered this problem before? Theoretically, there is no way to distinguish dynamic or view dependency in my opinion, for example the shadow can be explained by both.

Yes. for shadow or other dynamic volumetric effects such as smoke, they can go either way. So if you don't care about modeling view-dependent effects in a dynamic region (in most cases it's indeed hard to model from a monocular camera), it's a good idea to turn of view-dependent conditioning.

commented

Hi I tried this using my latest implementation, and it works as well as zhengqi's modifications.

box_sp20.mp4

I did not find

My thought is that a small camera baseline could lead to a degenerate solution for geometry reconstruction. In classical SfM, SLAM or MVS, we have to choose input frames that contain enough motion parallax (either using keyframe selection or input subsampled frames from the original video) before performing triangulation, otherwise, the triangulation solver can be ill-conditioned.

NeRF is very beautiful for a static scene because it has no such problem thanks to its unified 3D global representation. But for our 4D approach, you can think we are doing local triangulation in a small time window, so if the camera baseline is small or the background can be better modeled as Homography matrix than Essential matrix (which actually is the case in this video), the reconstruction can be stuck in a degenerated minimum.

this a problem. I use all 60 frames to reconstruct the poses and train.

To show you a better comparison, I do not use blending weight and let the network learn how to separate static and dynamic objects. This is the background learnt by the network (quite reasonable, the real background and some part of the body that only move a little):

box_sp20_bg.mp4

The advantage is, for static regions we know that it is going to perform as well as a normal NeRF, so no ghosting for these regions, and only the dynamic part might subject to artifact. While in this NSFF aside from the static network, the final rendering also depends on a blending weight that we cannot control. Like @zhengqili said maybe

My feeling is that the blending weights sometimes do not interpolate very well for larger viewpoint change in this video.

Time interpolation also looks good (artifacts around the face and the body, it actually moves but looks static), I'm thinking if I can use some prior to favor the network to learn all the body as dynamic...

box_fv20.mp4

@kwea123 Hello, can you share the releated turtorial for using this project for our custom video data.