The difference between HNeRV and AutoEncoder

Question

The difference between HNeRV and AutoEncoder

dawnlh opened this issue a year ago · comments

Hi~ @haochen-rye
Thanks for sharing your nice work. After reading the paper, I find that the network structure and design scheme of HNeRV seems to be similar to an auto-encoder (AE). Although original AEs are mainly used for supervised/unsupervised learning, appling it to data fitting/compression is also a direct & valid idea. For classical NeRF (or NeRV from your another work), one could use a coordinate to query the corresponding pixel value or frame values. But for HNeRV, the input is actually the video/frame itself rather than the coordinate, which means one cannot query designed data from an explicit coordinate and instead he must has the image embedding from the encoder beforehand to query the image.

I think this should be the main difference between HNeRV & conventional NeRF, NeRV, and ENeRV. So have I misunderstood something? And what's your opinion about the difference.

BTW, I wonder how long it takes to train the NeRV & HNerV. I didn't find the absolute time in the paper. Thanks.

hao · Answer 1 · Mon Mar 13 2023 21:37:49 GMT+0800 (China Standard Time)

HNeRV vs NeRV (or NeRF etc.):

hybrid vs implicit representation, input content-adaptive embed vs content-agnostic embed (coordinate)

HNeRV VS auto-encoder:

embed size: tiny vs huge, HNeRV uses a tiny frame embed which only store little video content while AE store all information in the huge embed
Generalization: HNeRV is only fit on one video, while AE is generally fit on a large dataset and should be able to reconstruct all these videos. Therefore AE is much bigger and slower than AE in most cases

Zephyr · Answer 2 · Mon Mar 13 2023 22:29:40 GMT+0800 (China Standard Time)

Got it. Then what about the training time? The paper gives the relative time consumption result, but I wonder how long it takes to finish the data fitting.

hao · Answer 3 · Tue Mar 14 2023 09:09:55 GMT+0800 (China Standard Time)

We conduct all experiments in Pytorch with RTX2080ti GPUs, where it takes around 8s per epoch to train a 130 frame video of size 640 × 1280.

Zephyr · Answer 4 · Tue Mar 14 2023 09:54:35 GMT+0800 (China Standard Time)

OK, thanks for your prompt reply.