How much data is used to finetune the model on segmentation task?

Question

How much data is used to finetune the model on segmentation task?

goodstudent9 opened this issue 2 months ago · comments

Great work!
From this page, there some instruction about preparing data for segmentation.
https://github.com/Beckschen/ViTamin/blob/main/vitamin_fcclip/GETTING_STARTED.md
So your model also needs to be finetuned to do segmentation task, right?
I want to know if you use all the segmentation dataset mentioned https://github.com/Beckschen/ViTamin/blob/main/vitamin_fcclip/datasets/README.md to finetune your model? Or you just finetune the model on a subset? And show great open-vocabulary capability?
Looking forward to your reply!

Jieneng Chen · Answer 1 · Sat Apr 20 2024 03:12:16 GMT+0800 (China Standard Time)

Thanks for your interests!
(1) For the first question "So your model also needs to be finetuned to do segmentation task":
We transfer our ViTamin models to open-vocabulary segmentation tasks, keeping the ViTamin frozen as the CLIP image encoder. Consequently, the ViTamin model remains unchanged during the transfer to the segmentation tasks.
What does the Frozen mean? Please refer to the paper: "Convolutions Die Hard: Open-Vocabulary
Segmentation with Single Frozen Convolutional CLIP" (https://arxiv.org/pdf/2308.02487.pdf)

(2) For the second question "if you use all the segmentation dataset mentioned":
Following FC-CLIP paper, we only train the open-vocab segmentation model on coco dataset and zero-shot evaluate it on the other datasets.

goodstudent9 · Answer 2 · Sat Apr 20 2024 10:54:14 GMT+0800 (China Standard Time)

Thanks for your interests! (1) For the first question "So your model also needs to be finetuned to do segmentation task": We transfer our ViTamin models to open-vocabulary segmentation tasks, keeping the ViTamin frozen as the CLIP image encoder. Consequently, the ViTamin model remains unchanged during the transfer to the segmentation tasks. What does the Frozen mean? Please refer to the paper: "Convolutions Die Hard: Open-Vocabulary Segmentation with Single Frozen Convolutional CLIP" (https://arxiv.org/pdf/2308.02487.pdf)

(2) For the second question "if you use all the segmentation dataset mentioned": Following FC-CLIP paper, we only train the open-vocab segmentation model on coco dataset and zero-shot evaluate it on the other datasets.

Thank you so much for your early reply!
May I know how many epochs the model is trained on COCO dataset?
Thank you!

Jieneng Chen · Answer 3 · Sun Apr 21 2024 01:41:28 GMT+0800 (China Standard Time)

I am happy to answer this! The training pipeline is here. We follow FC-CLIP and adopt the same training recipe and losses without any special design. The training batch size is 16, and the model is trained for 50 epochs on COCO panoptic training
set.

Jieneng Chen · Answer 4 · Sun Apr 21 2024 12:52:08 GMT+0800 (China Standard Time)

Let me know if you have further questions.