Beckschen / ViTamin

[CVPR 2024] Official implementation of "ViTamin: Designing Scalable Vision Models in the Vision-language Era"

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How much data is used to finetune the model on segmentation task?

goodstudent9 opened this issue · comments

Great work!
From this page, there some instruction about preparing data for segmentation.
https://github.com/Beckschen/ViTamin/blob/main/vitamin_fcclip/GETTING_STARTED.md
So your model also needs to be finetuned to do segmentation task, right?
I want to know if you use all the segmentation dataset mentioned https://github.com/Beckschen/ViTamin/blob/main/vitamin_fcclip/datasets/README.md to finetune your model? Or you just finetune the model on a subset? And show great open-vocabulary capability?
Looking forward to your reply!

Thanks for your interests!
(1) For the first question "So your model also needs to be finetuned to do segmentation task":
We transfer our ViTamin models to open-vocabulary segmentation tasks, keeping the ViTamin frozen as the CLIP image encoder. Consequently, the ViTamin model remains unchanged during the transfer to the segmentation tasks.
What does the Frozen mean? Please refer to the paper: "Convolutions Die Hard: Open-Vocabulary
Segmentation with Single Frozen Convolutional CLIP" (https://arxiv.org/pdf/2308.02487.pdf)

(2) For the second question "if you use all the segmentation dataset mentioned":
Following FC-CLIP paper, we only train the open-vocab segmentation model on coco dataset and zero-shot evaluate it on the other datasets.

Thanks for your interests! (1) For the first question "So your model also needs to be finetuned to do segmentation task": We transfer our ViTamin models to open-vocabulary segmentation tasks, keeping the ViTamin frozen as the CLIP image encoder. Consequently, the ViTamin model remains unchanged during the transfer to the segmentation tasks. What does the Frozen mean? Please refer to the paper: "Convolutions Die Hard: Open-Vocabulary Segmentation with Single Frozen Convolutional CLIP" (https://arxiv.org/pdf/2308.02487.pdf)

(2) For the second question "if you use all the segmentation dataset mentioned": Following FC-CLIP paper, we only train the open-vocab segmentation model on coco dataset and zero-shot evaluate it on the other datasets.

Thank you so much for your early reply!
May I know how many epochs the model is trained on COCO dataset?
Thank you!

I am happy to answer this! The training pipeline is here. We follow FC-CLIP and adopt the same training recipe and losses without any special design. The training batch size is 16, and the model is trained for 50 epochs on COCO panoptic training
set.

Let me know if you have further questions.