nsff_pl
Neural Scene Flow Fields using pytorch-lightning. This repo reimplements the NSFF idea, but modifies several operations based on observation of NSFF results and discussions with the authors. For discussion details, please see the issues of the original repo. The code is based on my previous implementation.
The main modifications are the followings:
- Remove the blending weight in static NeRF. I adopt the addition strategy in NeRF-W.
- Compose static dynamic also in image warping.
Implementation details are in models/rendering.py.
These modifications empirically produces better result on the kid-running
scene, as shown below:
IMPORTANT: The code for kid-running
scene is moved to nsff_orig branch (the images are still shown here just to showcase)! The master
branch will be updated for custom data usage.
Full reconstruction
Left: GT. Center: this repo (PSNR=35.02). Right: pretrained model of the original repo(PSNR=30.45).
Background reconstruction
Left: this repo. Right: pretrained model of the original repo (by setting raw_blend_w to 0).
Fix-view-change-time (view 8, times from 0 to 16)
Left: this repo. Right: pretrained model of the original repo.
Fix-time-change-view (time 8, views from 0 to 16)
Left: this repo. Right: pretrained model of the original repo.
Novel view synthesis (spiral)
The color of our method is more vivid and closer to the GT images both qualitatively and quantitatively (not because of gif compression). Also, the background is more stable and cleaner.
Bonus - Depth
Our method also produces smoother depths, although it might not have direct impact on image quality.
Top left: static depth from this repo. Top right: full depth from this repo.
Bottom left: static depth from the original repo. Bottom right: full depth from the original repo.
π» Installation
Hardware
- OS: Ubuntu 18.04
- NVIDIA GPU with CUDA>=10.2 (tested with 1 RTX2080Ti)
Software
- Clone this repo by
git clone --recursive https://github.com/kwea123/nsff_pl
- Python>=3.6 (installation via anaconda is recommended, use
conda create -n nsff_pl python=3.6
to create a conda environment and activate it byconda activate nsff_pl
) - Python libraries
- Install core requirements by
pip install -r requirements.txt
- Install core requirements by
π Training
0. Data preparation
The data preparation follows the original repo. Therefore, please follow here to prepare the data (resized images, monodepth and flow) for training. If your data format follows the original repo or use the kid-running
sequence, please use nsff_orig branch.
Otherwise, create a root directory (e.g. foobar
), create a folder named images
and prepare your images (it is recommended to have at least 30 images) under it, so the structure looks like:
βββ foobar
βββ images
βββ 00000.png
...
βββ 00029.png
Save the root directory as an environment variable to simplify the code in the following processes:
export ROOT_DIR=/path/to/foobar/
The image names can be arbitrary, but the lexical order should be the same as time order! E.g. you can name the images as a.png
, c.png
, dd.png
but the time order must be a -> c -> dd
.
1. Motion mask prediction and COLMAP pose reconstruction
Motion mask prediction
In order to correctly reconstruct the camera poses, we must first filter out the dynamic areas so that feature points in these areas are not matched during estimation.
I use maskrcnn from detectron2. Only semantic masks are used, as I find flow-based masks too noisy.
Install detectron2 by python -m pip install detectron2 -f https://dl.fbaipublicfiles.com/detectron2/wheels/cu102/torch1.8/index.html
.
Modify the DYNAMIC_CATEGORIES
variable in third_party/predict_mask.py
to the dynamic classes in your data (only COCO classes are supported). Run python third_party/predict_mask.py --root_dir $ROOT_DIR
. After that, your root directory will contain motion masks (0=dynamic and 1=static):
βββ foobar
βββ images
β βββ 00000.png
β ...
β βββ 00029.png
βββ masks
βββ 00000.png.png
...
βββ 00029.png.png
The masks need not be perfect, they can mask static regions, but most of the dynamic regions MUST lie inside the mask. If not, try lowering the DETECTION_THR
or changing the DYNAMIC_CATEGORIES
.
COLMAP pose reconstruction
Please first install COLMAP following the official tutorial.
Here I only briefly explain how to reconstruct the poses using GUI. For command line usage, please search by yourself.
- Run
colmap gui
. - Select the tab
Reconstruction -> Automatic reconstruction
. - Select "Workspace folder" as
foobar
, "Image folder" asfoobar/images
, "Mask folder" asfoobar/masks
. - Select "Data type" as "Video frames".
- Check "Shared intrinsics" and uncheck "Dense model".
- Press "Run".
After reconstruction, you should see reconstructed camera poses as red quadrangular pyramids, and some reconstructed point clouds. Please roughly judge if the poses are correct (e.g. if your camera moves forward, but COLMAP reconstructs horizontal movements, then this is incorrect), if not, consider retake the photos.
Now your root directory should look like:
βββ foobar
βββ images
β βββ 00000.png
β ...
β βββ 00029.png
βββ masks
β βββ 00000.png.png
β ...
β βββ 00029.png.png
βββ database.db
βββ sparse
βββ 0
βββ cameras.bin
βββ images.bin
βββ points3D.bin
βββ project.ini
2. Monodepth and optical flow prediction
Monodepth
The instructions and code are borrowed from BoostingMonocularDepth.
-
Download the mergenet model weights from here and put it in
third_party/depth/pix2pix/checkpoints/mergemodel/
. -
Download the model weights from MiDas-v2 and put it in
third_party/depth/midas/
. -
From
thrid_party/depth
, runpython run.py --Final --data_dir $ROOT_DIR/images --output_dir $ROOT_DIR/disps --depthNet 0
It will create 16bit depth images under $ROOT_DIR/disps
. This monodepth method is more accurate than most of the SOTA method, so it takes a few seconds to process each image.
RAFT
The instructions and code are borrowed from RAFT.
-
Download
raft-things.pth
from google drive and put it inthird_party/flow/models/
. -
From
third_party/flow/
, runpython demo.py --model models/raft-things.pth --path $ROOT_DIR
.
Finally, your root directory will have all of this:
βββ foobar
βββ images
β βββ 00000.png
β ...
β βββ 00029.png
βββ masks
β βββ 00000.png.png
β ...
β βββ 00029.png.png
βββ database.db
βββ sparse
β βββ 0
β βββ cameras.bin
β βββ images.bin
β βββ points3D.bin
β βββ project.ini
βββ disps
β βββ 00000.png
β ...
β βββ 00029.png
βββ flow_fw
β βββ 00000.flo
β ...
β βββ 00028.flo
βββ flow_bw
βββ 00001.flo
...
βββ 00029.flo
Now you can start training!
3. Train!
Run the following command (modify the parameters according to opt.py
):
python train.py \
--dataset_name monocular --root_dir $ROOT_DIR \
--img_wh 512 288 --start_end 0 30 --batch_from_same_image \
--N_samples 128 --N_importance 0 --encode_t \
--num_epochs 50 --batch_size 512 \
--optimizer adam --lr 5e-4 --lr_scheduler cosine \
--exp_name exp
Comparison with other repos
training GPU memory in GB (batchsize=512) | speed (1 step) | training time/final PSNR on kid-running | |
---|---|---|---|
Original | 7.6 | 0.2s | 96 GPUh / 30.45 |
This repo | 5.9 | 0.2s | 12 GPUh / 35.02 |
The speed is measured on 1 RTX2080Ti.
π Testing
See test.ipynb for scene reconstruction, scene decomposition, fix-time-change-view, ..., etc. You can get almost everything out of this notebook. I will add more instructions inside in the future.
Use eval.py to create the whole sequence of moving views. E.g.
python eval.py \
--dataset_name monocular --root_dir $ROOT_DIR \
--N_samples 128 --N_importance 0 --img_wh 512 288 --start_end 0 30 \
--encode_t --output_transient \
--split test --video_format gif --fps 5 \
--ckpt_path kid.ckpt --scene_name kid_reconstruction
β οΈ Other differences with the original paper
- I add entropy loss as suggested here. This allows the person to be "thin" and produces less artifact when the camera is far from the original pose.
- I explicitly zero the flows of far regions to avoid the flow being trapped in local minima (reason explained here).
TODO
- Add COLMAP reconstruction tutorial (mask out dynamic region).
- Remove NSFF dependency for data preparation. More precisely, the original code needs quite a lot modifications to work on own data, and the depth/flow are calculated on resized images, which might reduce their accuracy.
- Add spiral path for testing.
- Add mask hard mining at the beginning of training.
- Exploit motion mask prior like https://free-view-video.github.io/