muP for contrastive losses

Question

muP for contrastive losses

xwjabc opened this issue 2 years ago · comments

Hi, I have a question regarding the use of muP in contrastive losses: Assume we have anchor embedding x, positive embedding x_pos, and negative embedding x_neg. All x, x_pos, and x_neg are C-dim vectors where C represents the width that is categorized as an infinite dimension. The loss L is formulated as:

L = -log( exp(sim(x, x_pos)) / (exp(sim(x, x_pos)) + exp(sim(x, x_neg))) )

where sim(a, b) = cos(a, b) for each embedding pair. It seems the sim() merges two infinite-dim vectors to a finite one, which is similar to the Q K^T operation in self-attention. However, the difference is that the cosine similarity already bounds the output. Thus, I wonder if there is anything we need to change in the loss function when we use muP? Thanks!

Edward Hu · Answer 1 · Sat Jul 16 2022 00:31:27 GMT+0800 (China Standard Time)

Hi Weijian,

You are right that cosine similarity is okay here. The reason is that sim(x, x') = x^Tx' / (||x|| ||x'||). The denominator here gives the correct scaling factor, just like in the attention case with Q and K.

Weijian Xu · Answer 2 · Sat Jul 16 2022 05:08:15 GMT+0800 (China Standard Time)

Gotcha. Thank you for your response!