UMass-Foundation-Model / 3D-LLM

Hi

When I run the blip_oa.py in '3DLanguage_data/ChatCaptioner_based/gen_features/'
The Error: RuntimeError: Input type (unsigned char) and bias type (c10::Half) should be the same.
I try to revise the code: 'output = visual_encoder(image.float())' in 167. Is it OK?

Another one, It seems the number of rendered images should be 8, not 4

When I run the blip_oa.py in '3DLanguage_data/ChatCaptioner_based/gen_features/'
The Error: RuntimeError: Input type (unsigned char) and bias type (c10::Half) should be the same.
I try to revise the code: 'output = visual_encoder(image.float())' in 167. Is it OK?

For the first question, we have already update the script for generating features. #43

Another one, It seems the number of rendered images should be 8, not 4

In our final version, we use 4 images to generate the caption using ChatCaptioner. This produces the best results. However, you can choose any number of viewpoints when generating features. You can modify the theta view in this line:

3D-LLM/3DLanguage_data/ChatCaptioner_based/objaverse_render/render.py

Lines 52 to 63 in 9717617

    
           theta_view = [ 
        
               [-1 / 4, -1 / 4], 
        
               [1 / 4, 1 / 4], 
        
               [3 / 4, 3 / 4], 
        
               [-3 / 4, -3 / 4], 
        
           ] 
        
           phi_view = [ 
        
               [1 / 12, 1 / 12], 
        
               [1 / 12, 1 / 12], 
        
               [1 / 12, 1 / 12], 
        
               [1 / 12, 1 / 12], 
        
           ]

fixed in beff99e

As shown in the annotated part , the program cannot be executed if < 25 or 8.

Did you use 4 images or 8 images to render the open source object data you previously used？

When generating the 3D features, we used 8 images.

	theta_view = [
	[-1 / 4, -1 / 4],
	[1 / 4, 1 / 4],
	[3 / 4, 3 / 4],
	[-3 / 4, -3 / 4],
	]
	phi_view = [
	[1 / 12, 1 / 12],
	[1 / 12, 1 / 12],
	[1 / 12, 1 / 12],
	[1 / 12, 1 / 12],
	]

the processing of objaverse feature