nsff_pl

Neural Scene Flow Fields using pytorch-lightning. This repo reimplements the NSFF idea, but modifies several operations based on observation of NSFF results and discussions with the authors. For discussion details, please see the issues of the original repo. The code is based on my previous implementation.

The main modifications are the followings:

Remove the blending weight in static NeRF. I adopt the addition strategy in NeRF-W.
Compose static dynamic also in image warping.

Implementation details are in models/rendering.py.

These modifications empirically produces better result on the kid-running scene, as shown below:

IMPORTANT: The code for kid-running scene is moved to nsff_orig branch (the images are still shown here just to showcase)! The master branch will be updated for custom data usage.

Full reconstruction

^{Left: GT. Center: this repo (PSNR=35.02). Right: pretrained model of the original repo(PSNR=30.45).}

Background reconstruction

^{Left: this repo. Right: pretrained model of the original repo (by setting raw_blend_w to 0).}

Fix-view-change-time (view 8, times from 0 to 16)

^{Left: this repo. Right: pretrained model of the original repo.}

Fix-time-change-view (time 8, views from 0 to 16)

^{Left: this repo. Right: pretrained model of the original repo.}

Novel view synthesis (spiral)

The color of our method is more vivid and closer to the GT images both qualitatively and quantitatively (not because of gif compression). Also, the background is more stable and cleaner.

Bonus - Depth

Our method also produces smoother depths, although it might not have direct impact on image quality.

^{Top left: static depth from this repo. Top right: full depth from this repo.
Bottom left: static depth from the original repo. Bottom right: full depth from the original repo.}

⚠️ However, more experiments on other scenes are needed to finally prove that these modifications produce overall better quality.

💻 Installation

Hardware

OS: Ubuntu 18.04
NVIDIA GPU with CUDA>=10.2 (tested with 1 RTX2080Ti)

Software

Clone this repo by git clone --recursive https://github.com/kwea123/nsff_pl
Python>=3.6 (installation via anaconda is recommended, use conda create -n nsff_pl python=3.6 to create a conda environment and activate it by conda activate nsff_pl)
Python libraries
- Install core requirements by pip install -r requirements.txt

🔑 Training

0. Data preparation

~~The data preparation follows the original repo. Therefore, please follow here to prepare the data (resized images, monodepth and flow) for training.~~ If your data format follows the original repo or use the kid-running sequence, please use nsff_orig branch.

Otherwise, create a root directory (e.g. foobar), create a folder named images and prepare your images (it is recommended to have at least 30 images) under it, so the structure looks like:

└── foobar
    └── images
        ├── 00000.png
        ...
        └── 00029.png

Save the root directory as an environment variable to simplify the code in the following processes:

export ROOT_DIR=/path/to/foobar/

The image names can be arbitrary, but the lexical order should be the same as time order! E.g. you can name the images as a.png, c.png, dd.png but the time order must be a -> c -> dd.

1. Motion mask prediction and COLMAP pose reconstruction

Motion mask prediction

In order to correctly reconstruct the camera poses, we must first filter out the dynamic areas so that feature points in these areas are not matched during estimation.

I use maskrcnn from detectron2. Only semantic masks are used, as I find flow-based masks too noisy.

Install detectron2 by python -m pip install detectron2 -f https://dl.fbaipublicfiles.com/detectron2/wheels/cu102/torch1.8/index.html.

Modify the DYNAMIC_CATEGORIES variable in third_party/predict_mask.py to the dynamic classes in your data (only COCO classes are supported). Run python third_party/predict_mask.py --root_dir $ROOT_DIR. After that, your root directory will contain motion masks (0=dynamic and 1=static):

└── foobar
    ├── images
    │   ├── 00000.png
    │   ...
    │   └── 00029.png
    └── masks
        ├── 00000.png.png
        ...
        └── 00029.png.png

The masks need not be perfect, they can mask static regions, but most of the dynamic regions MUST lie inside the mask. If not, try lowering the DETECTION_THR or changing the DYNAMIC_CATEGORIES.

COLMAP pose reconstruction

Please first install COLMAP following the official tutorial.

Here I only briefly explain how to reconstruct the poses using GUI. For command line usage, please search by yourself.

Run colmap gui.
Select the tab Reconstruction -> Automatic reconstruction.
Select "Workspace folder" as foobar, "Image folder" as foobar/images, "Mask folder" as foobar/masks.
Select "Data type" as "Video frames".
Check "Shared intrinsics" and uncheck "Dense model".
Press "Run".

After reconstruction, you should see reconstructed camera poses as red quadrangular pyramids, and some reconstructed point clouds. Please roughly judge if the poses are correct (e.g. if your camera moves forward, but COLMAP reconstructs horizontal movements, then this is incorrect), if not, consider retake the photos.

Now your root directory should look like:

└── foobar
    ├── images
    │   ├── 00000.png
    │   ...
    │   └── 00029.png
    ├── masks
    │   ├── 00000.png.png
    │   ...
    │   └── 00029.png.png
    ├── database.db
    └── sparse
        └── 0
            ├── cameras.bin
            ├── images.bin
            ├── points3D.bin
            └── project.ini

2. Monodepth and optical flow prediction

Monodepth

The instructions and code are borrowed from BoostingMonocularDepth.

Download the mergenet model weights from here and put it in third_party/depth/pix2pix/checkpoints/mergemodel/.
Download the model weights from MiDas-v2 and put it in third_party/depth/midas/.
From thrid_party/depth, run python run.py --Final --data_dir $ROOT_DIR/images --output_dir $ROOT_DIR/disps --depthNet 0

It will create 16bit depth images under $ROOT_DIR/disps. This monodepth method is more accurate than most of the SOTA method, so it takes a few seconds to process each image.

RAFT

The instructions and code are borrowed from RAFT.

Download raft-things.pth from google drive and put it in third_party/flow/models/.
From third_party/flow/, run python demo.py --model models/raft-things.pth --path $ROOT_DIR.

Finally, your root directory will have all of this:

└── foobar
    ├── images
    │   ├── 00000.png
    │   ...
    │   └── 00029.png
    ├── masks
    │   ├── 00000.png.png
    │   ...
    │   └── 00029.png.png
    ├── database.db
    ├── sparse
    │   └── 0
    │       ├── cameras.bin
    │       ├── images.bin
    │       ├── points3D.bin
    │       └── project.ini
    ├── disps
    │   ├── 00000.png
    │   ...
    │   └── 00029.png
    ├── flow_fw
    │   ├── 00000.flo
    │   ...
    │   └── 00028.flo
    └── flow_bw
        ├── 00001.flo
        ...
        └── 00029.flo

Now you can start training!

3. Train!

Run the following command (modify the parameters according to opt.py):

python train.py \
  --dataset_name monocular --root_dir $ROOT_DIR \
  --img_wh 512 288 --start_end 0 30 --batch_from_same_image \
  --N_samples 128 --N_importance 0 --encode_t \
  --num_epochs 50 --batch_size 512 \
  --optimizer adam --lr 5e-4 --lr_scheduler cosine \
  --exp_name exp

Comparison with other repos

	training GPU memory in GB (batchsize=512)	speed (1 step)	training time/final PSNR on kid-running
Original	7.6	0.2s	96 GPUh / 30.45
This repo	5.9	0.2s	12 GPUh / 35.02

The speed is measured on 1 RTX2080Ti.

🔎 Testing

See test.ipynb for scene reconstruction, scene decomposition, fix-time-change-view, ..., etc. You can get almost everything out of this notebook. I will add more instructions inside in the future.

Use eval.py to create the whole sequence of moving views. E.g.

python eval.py \
  --dataset_name monocular --root_dir $ROOT_DIR \
  --N_samples 128 --N_importance 0 --img_wh 512 288 --start_end 0 30 \
  --encode_t --output_transient \
  --split test --video_format gif --fps 5 \
  --ckpt_path kid.ckpt --scene_name kid_reconstruction

⚠️ Other differences with the original paper

I add entropy loss as suggested here. This allows the person to be "thin" and produces less artifact when the camera is far from the original pose.
I explicitly zero the flows of far regions to avoid the flow being trapped in local minima (reason explained here).

TODO

Add COLMAP reconstruction tutorial (mask out dynamic region).
Remove NSFF dependency for data preparation. More precisely, the original code needs quite a lot modifications to work on own data, and the depth/flow are calculated on resized images, which might reduce their accuracy.
Add spiral path for testing.
Add mask hard mining at the beginning of training.
Exploit motion mask prior like https://free-view-video.github.io/

ionvision / nsff_pl