why the projector head MLP set to requires_grad=False

Question

why the projector head MLP set to requires_grad=False

woshixiaobai2019 opened this issue 8 months ago · comments

I noticed that a new projector head MLP is added after loading the pre-trained MoCo v3 model. However, the parameters of this newly added component are also set to requires_grad=False.

My question is - since this MLP head is randomly initialized, why does it not require any training before being used for feature projection?

Intuitively, adding an untrained random projection head could disrupt the original feature distributions learned by the pre-trained encoder. So what is the motivation behind fixing the parameters of this newly added head?

Does it relate to better retaining the pre-trained feature distributions? Or leveraging the fixed random projections to improve generalization of the downstream tasks?

It will be great if someone could help explain the rationale behind not training the added projector head. Thanks!

Iaroslav Koshelev · Answer 1 · Fri Jun 14 2024 17:25:22 GMT+0800 (China Standard Time)

I have the same question, @LTH14 could you please clarify how do you train the newly added projection head of the MoCoV3 model?

Tianhong Li · Answer 2 · Mon Jun 17 2024 12:57:27 GMT+0800 (China Standard Time)

The head in moco_vits is inherited from timm's VisionTransformer, which is a single Linear layer. However, the projection head of the pre-trained moco-v3 is an MLP (module.base_encoder.head). I don't want to change the original MoCo code and re-train that model, so instead I replace the Linear head with an MLP head so that the pre-trained weights can be loaded.

Tianhong Li · Answer 3 · Mon Jun 17 2024 12:59:23 GMT+0800 (China Standard Time)

Note that the initialization of the pre-trained encoder (and MLP) is before loading the pre-trained weights, as shown here https://github.com/LTH14/rcg/blob/main/pixel_generator/mage/models_mage.py#L263-L278