Always return the same photos

Question

Always return the same photos

kwea123 opened this issue a year ago · comments

No matter what I ask, it always returns the same set of photos. Not always the same photo but from a set of maybe 5~6 same images.
Like this one for "office". I intentionally gave a prompt that couldn't be explained with this image: "show me the ceiling", but it still shows this same photo

Jianing Yang · Answer 1 · Mon May 22 2023 00:52:32 GMT+0800 (China Standard Time)

Hi @kwea123 thanks for your feedback!

This is a small optimization we did just for the demo page: we pre-rendered 6 images for each scene (standing in the middle of the room, rotating 6 times) to speed up the image rendering step so that demo users won't wait too long while the agent is reasoning (rendering pictures in real time using nerfstudio+LERF takes about 30 seconds).

We are aware this is big pain point for this pipeline, therefore immediately next on our TODO list is a "smarter" rendering step, specifically,

We will pre-calculate the 3D semantic embeddings (using something like CLIP-vision encoder) of all "key points" in the scene
When a user text query comes in, we calculate a relevancy metric of the query over all key point embeddings and determine the "hot areas" of the room where we should render pictures for downstream reasoning
Only pass these photos of the "hot areas" for downstream multimodal reasoning pipeline

Hopefully, this will speed up the process so that we will be able to use real-time rendered images instead of 6 pre-rendered photos for the demo, and also that the photos are taken with better camera poses because they are now conditioned on the text query instead of a fixed point in the room.

We are also open to better ideas and implementations for this rendering step. Would love to hear what you think!

Mu Cai · Answer 2 · Mon May 22 2023 13:15:43 GMT+0800 (China Standard Time)

NeRF rendering is slow, and GPT4 matching and searching are also slow. So I think real-time feedback is nearly impossible here.

Jianing Yang · Answer 3 · Tue May 23 2023 00:30:47 GMT+0800 (China Standard Time)

@mu-cai I believe it should be doable with the above-proposed pipeline. Based on our experience, it takes 3-5 seconds to render a 512 * 512 picture in NeRF. So if we calculate the relevancy scores and only take pictures around the the relevant areas (only 1-2 pics for a text query), the experience should be near-real-time. We can also add streaming for the rendering process (i.e., display a picture once it's done instead of waiting for all images to finish) which can further reduced the perceived latency from user side.

Jianing Yang · Answer 4 · Thu Jun 01 2023 15:57:36 GMT+0800 (China Standard Time)

Hi @kwea123 @mu-cai, we have implemented the above features in #17. You can try out these new features with our latest demo to see them in action!

Some quick screenshots:

"How many doors are there in this room?"

"find all the chairs"