NExT-ChatV / NExT-Chat

Is there some kind of template to explain to the model that a mask is needed?
Also, the model is very unlikely (almost never) to give multiple bboxes for the same class when requesting something like: "Detect XXX. Please include several object locations". Most often the model just gives a bbox that combines all matching objects into one, but I would like them to be separate.

Thanks for your question! The "Detect XXX" prompt is mainly trained on the single object detection scenarios. A way to include multiple similar object is to use "Can you describe the image and detect objects?". However, this way is only for the whole image description but weak in referring the given objects. If you want to achieve this, I think a further fine-tuning on such kind of data is required. We are sorry about the incapability now.

Thanks for the reply. Is there a prompt that will tell the model that I need a mask?

If there is a box, there will be a corresponding mask. However, I filter out some low-quality masks by using a predicted iou_thres. You can modify the iou_thres to 0 at

NExT-Chat/mllm/demo/demo_util.py

Line 267 in ea67b83

    
           temperature=0.75, top_p=0.7, top_k=5, boxes=None, boxes_seq=None, iou_thres=0.3):

.

Which prompt use to generate the masks?