Question about Inference Time and Generalization

Question

DiamondGlassDrill opened this issue 5 months ago · comments

Thank you immensely for your exceptional work! Your innovative and much-needed approach is truly remarkable.

Before proceeding with testing the model, I have a couple of inquiries:

Regarding the inference time, could you provide an estimate of how long it takes to determine the diffused camera positions with 4, 8, and 16 images? (for 256px^2 and 512px^2 resolution). Specifically, I'm interested in performance on widely used GPUs like the 3090 or 4090.
I'm curious about the model's generalization capabilities, especially with objects it hasn't encountered during training. For instance, if the model has been trained predominantly on real estate images, it might excel with cubic geometries. But how does it perform with datasets containing multiple household objects? To what extent can it generalize to objects/ "geometries" it has never seen before?

I greatly appreciate your insights in advance.

Jason Y. Zhang · Answer 1 · Thu Feb 29 2024 04:01:16 GMT+0800 (China Standard Time)

Our model always expects image crops of 224x224 (because of our dino backbone), so all images will be resized to that size. I just benchmarked 8 images on a 3090, and inference took 11 seconds using the diffusion model. If you have more than 8 images, the runtime should grow roughly linearly
Since we train on CO3D, I would expect the model to generalize to most household objects as long as the photos are taken in a roughly object-facing fashion. You look at Figure 6 for some examples of what those types of images might look like. I would not expect the model to work on real estate images (ie outdoor or images where the cameras are pointed away from each other).