OpenGVLab / all-seeing

[ICLR 2024 & ECCV 2024] The All-Seeing Projects: Towards Panoptic Visual Recognition&Understanding and General Relation Comprehension of the Open World"

Home Page:https://huggingface.co/spaces/OpenGVLab/all-seeing

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Special tokens

KooSung opened this issue · comments

Nice work! Why didn't all-see-v2 add <ref> etc. to the special tokens?

Thank you for your interest in our project.

Early experimental results indicate that adding special tokens, such as <ref>, <box>, and <rel> has only a minor impact on performance. Therefore, to maintain simplicity, we have decided not to add any special tokens.

@Weiyun1025 Thanks. Another question, during the training of the regular detection model, it is necessary to adjust the bbox based on the image preprocessing, but why is it only necessary to normalize the bbox to 0-1000 (or with square_pad) during LLM training? Qwen-VL also does this, but the reason is not explained.

Adjusting the bboxes is necessary when data augmentation is utilized. However, we do not use any data augmentation except for image flipping, for which we preprocess the bboxes offline.

For the second question, since the input size of ASMv2 is only 336x336, a scale of 1000 is large enough. If the input size were scaled up to 2000x2000, it might be necessary to enlarge the scale.