NVIDIA / vid2vid

Pytorch implementation of our method for high-resolution (e.g. 2048x1024) photorealistic video-to-video translation.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

spatio-temporal growing

fa9r opened this issue · comments

Hey, in your paper you mention in the experiment section: "Implementation details. We train our network in a spatio-temporally progressive manner. In particular, we start with generating low-resolution videos with few frames, and all the way up to generating full resolution videos with 30 (or more) frames."

How exactly did you do the scaling? I looked through your code, but couldn't find anything related to it. In particular, I would like to know whether you increase both spatial/temporal size at the same time, or one after another, and whether you adjusted other hparams when using it.

What I mean by adjusting hparams is that this progressive growing is mainly used to reduce train time I guess, so if the model would usually take N epochs to train on high res vids from scratch, you probably trained it on M<N epochs per progressive growing stage right? Then what is M? N/#stages? And did you use larger batch size for smaller resolutions, or make any other notable hparam changes?

I think the author use the base_model.build_pyr function to generate video of different resolution. And using the hparams --loadSize and --n_scales_spatial to control it. In the Generator, we can see the for loop: for s in range(n_scales): means "spatio-temporally progressive manner."

I think these parameters just build the coarse-to-fine generator of pix2pixhd. Is that really what's meant by spatio-temporal progressive? Then what is the temporal aspect? The temporal scale of D?

The temporal aspect can be explained by the iteration of fake_B_prevs

Hmmm, okay I guess. Seems like I understood it wrongly then. Thanks for the help.

Update: In README it says for Cityscapes training:

  • We adopt a coarse-to-fine approach, sequentially increasing the resolution from 512 x 256, 1024 x 512, to 2048 x 1024.
  • Train a model at 512 x 256 resolution (bash ./scripts/street/train_512.sh)
  • Train a model at 1024 x 512 resolution (must train 512 x 256 first) (bash ./scripts/street/train_1024.sh)

This looks a lot more like how I understood "spatio-temporally progressive". I.e. you train 512x256 -> 1024x512 -> 2048x1024.

EDIT: I finally found the temporal growing, it is set by --niter_step, which doubles the sequence length every few epochs.

TL;DR: spatial growing is implemented by manually increasing --load_size and initializing from previous (smaller) checkpoint. Temporal growing is implemented by doubling sequence length every --niter_step iterations.