Average of the context vector in lecture "Contextual Word Representation"

Question

Average of the context vector in lecture "Contextual Word Representation"

lasp73 opened this issue 2 years ago · comments

Thank you for the great course! The course lectures and the other materials are really valuable to learn more about NLU.

I am not an enrolled student, but I've decided ask here a minor question related to the first lecture about "Contextual Word Representation".

In slide 5 (https://web.stanford.edu/class/cs224u/slides/cs224u-contextualreps-part1-handout.pdf), the "context vector" is evaluated as $κ = mean([α_1.h_1, α_2.h_2, α_3.h_3])$.

My question: Is it really necessary to do the "mean" operation instead of a "sum" ?

The attention weights $a_n$ are already from a softmax. The term $sum([α_1.h_1, α_2.h_2, α_3.h_3])$ would be a "weighted average" of the hidden states.

What I see often is to scale the dot products (before the softmax) $h^T_C.h_n$ by $1/\sqrt{d_k}$, where $d_k$ is the vector dimension, to normalize the variance (and get better results) as presented in the paper "Attention Is All You Need".

Thanks again!

Luis Pessoa · Answer 1 · Sun Oct 16 2022 23:05:29 GMT+0800 (China Standard Time)

Ok, the very next lecture talks about the changes above. So, I am closing the issue.