sled-group / chat-with-nerf

Chat with NeRF enables users to interact with a NeRF model by typing in natural language.

Home Page:https://chat-with-nerf.github.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Always return the same photos

kwea123 opened this issue · comments

commented

No matter what I ask, it always returns the same set of photos. Not always the same photo but from a set of maybe 5~6 same images.
Like this one for "office". I intentionally gave a prompt that couldn't be explained with this image: "show me the ceiling", but it still shows this same photo
image

Hi @kwea123 thanks for your feedback!

This is a small optimization we did just for the demo page: we pre-rendered 6 images for each scene (standing in the middle of the room, rotating 6 times) to speed up the image rendering step so that demo users won't wait too long while the agent is reasoning (rendering pictures in real time using nerfstudio+LERF takes about 30 seconds).

We are aware this is big pain point for this pipeline, therefore immediately next on our TODO list is a "smarter" rendering step, specifically,

  • We will pre-calculate the 3D semantic embeddings (using something like CLIP-vision encoder) of all "key points" in the scene
  • When a user text query comes in, we calculate a relevancy metric of the query over all key point embeddings and determine the "hot areas" of the room where we should render pictures for downstream reasoning
  • Only pass these photos of the "hot areas" for downstream multimodal reasoning pipeline

Hopefully, this will speed up the process so that we will be able to use real-time rendered images instead of 6 pre-rendered photos for the demo, and also that the photos are taken with better camera poses because they are now conditioned on the text query instead of a fixed point in the room.

We are also open to better ideas and implementations for this rendering step. Would love to hear what you think!

commented

NeRF rendering is slow, and GPT4 matching and searching are also slow. So I think real-time feedback is nearly impossible here.

@mu-cai I believe it should be doable with the above-proposed pipeline. Based on our experience, it takes 3-5 seconds to render a 512 * 512 picture in NeRF. So if we calculate the relevancy scores and only take pictures around the the relevant areas (only 1-2 pics for a text query), the experience should be near-real-time. We can also add streaming for the rendering process (i.e., display a picture once it's done instead of waiting for all images to finish) which can further reduced the perceived latency from user side.

Hi @kwea123 @mu-cai, we have implemented the above features in #17. You can try out these new features with our latest demo to see them in action!

Some quick screenshots:

"How many doors are there in this room?"
image

"find all the chairs"
image