Are efficientnet for img encoder and bert for text encoder fixed or partially trainable？

Question

Are efficientnet for img encoder and bert for text encoder fixed or partially trainable？

phelogges opened this issue 3 years ago · comments

Following Readme, some extra models are required, including chinese-roberta-wwm-ext, used as sub-model of text encoder, and tf_efficientnet_b5_ns-6f26d0cf.pth, used as sub-model of image encoder. (According to BriVL-BUA-applications)

While in ImgLearnableEncoder.init_param function, TextLearnableEncoder.init_param function, We noticed that there are some conditions to control if some params of these backbones, i.e. efficientnet and chinese-roberta-wwm-ext mentioned above, are requires_grad or not, or saying whether these params are trainable.

And these two classes are used in eval from VL_model class.

Thus this eval makes me confused: VL_model is TRAINABLE, which means downloaded official sub-models, efficientnet and chinese-roberta-wwm-ext are NOT satisfied, their finetuned models are required, is there something wrong?

i don't know if i missed some details or mistook something.

Looking forward to your reply:)

phelogges · Answer 1 · Fri Dec 24 2021 15:27:57 GMT+0800 (China Standard Time)

Closed this issue cause trainable sub-models are packed into whole BriVL model.