UMass-Foundation-Model / 3D-LLM

Code for 3D-LLM: Injecting the 3D World into Large Language Models

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

the processing of objaverse feature

SwimZhang opened this issue · comments

Hi

When I run the blip_oa.py in '3DLanguage_data/ChatCaptioner_based/gen_features/'
The Error: RuntimeError: Input type (unsigned char) and bias type (c10::Half) should be the same.
I try to revise the code: 'output = visual_encoder(image.float())' in 167. Is it OK?

Another one, It seems the number of rendered images should be 8, not 4
image

When I run the blip_oa.py in '3DLanguage_data/ChatCaptioner_based/gen_features/'
The Error: RuntimeError: Input type (unsigned char) and bias type (c10::Half) should be the same.
I try to revise the code: 'output = visual_encoder(image.float())' in 167. Is it OK?

For the first question, we have already update the script for generating features. #43

Another one, It seems the number of rendered images should be 8, not 4

In our final version, we use 4 images to generate the caption using ChatCaptioner. This produces the best results. However, you can choose any number of viewpoints when generating features. You can modify the theta view in this line:

theta_view = [
[-1 / 4, -1 / 4],
[1 / 4, 1 / 4],
[3 / 4, 3 / 4],
[-3 / 4, -3 / 4],
]
phi_view = [
[1 / 12, 1 / 12],
[1 / 12, 1 / 12],
[1 / 12, 1 / 12],
[1 / 12, 1 / 12],
]

fixed in beff99e

image

As shown in the annotated part , the program cannot be executed if < 25 or 8.

Did you use 4 images or 8 images to render the open source object data you previously used?

When generating the 3D features, we used 8 images.