Weird inference time for grounding_dino with vit_h and vit_tiny

Question

Weird inference time for grounding_dino with vit_h and vit_tiny

stupidyoh opened this issue 3 months ago · comments

Hello! Thank you for your great work.

Recently, I tested several given code like "grounded_light_hqsam" and "grounded_sam_simple_demo".
And there is some weird results for following code.

(First part)
detections = grounding_dino_model.predict_with_classes(
image=image,
classes=CLASSES,
box_threshold=BOX_THRESHOLD,
text_threshold=BOX_THRESHOLD
)

(Second part)
detections.mask = segment(
sam_predictor=sam_predictor,
image=cv2.cvtColor(image, cv2.COLOR_BGR2RGB),
xyxy=detections.xyxy
)

For grounded_light_hqsam using "vit_h" for sam encoder, first part takes 1.574 second and second part takes 0.611 second.
And for grounded_sam_simple_demo using "vit_tiny", first part takes 2.177 second and second part takes 0.136 second.

In my opinion, the shorter time for second part is okay because vit_tiny is light model.
But I have no idea why the first part takes more time for vit_tiny.

I want to use these model in real-time, so I want it to take a shorter time.
I would appreciate it if you could give me some advice on why this result came out and how to shorten the time.

Thank you!

YoH · Answer 1 · Tue Mar 12 2024 19:28:05 GMT+0800 (China Standard Time)

I'm sorry.
It takes different time for every single test.
But the deviation is larger than I thought.