cgpotts / cs224u

Code for Stanford CS224u

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Average of the context vector in lecture "Contextual Word Representation"

lasp73 opened this issue · comments

Thank you for the great course! The course lectures and the other materials are really valuable to learn more about NLU.

I am not an enrolled student, but I've decided ask here a minor question related to the first lecture about "Contextual Word Representation".

In slide 5 (https://web.stanford.edu/class/cs224u/slides/cs224u-contextualreps-part1-handout.pdf), the "context vector" is evaluated as $κ = mean([α_1.h_1, α_2.h_2, α_3.h_3])$.

My question: Is it really necessary to do the "mean" operation instead of a "sum" ?

The attention weights $a_n$ are already from a softmax. The term $sum([α_1.h_1, α_2.h_2, α_3.h_3])$ would be a "weighted average" of the hidden states.

What I see often is to scale the dot products (before the softmax) $h^T_C.h_n$ by $1/\sqrt{d_k}$, where $d_k$ is the vector dimension, to normalize the variance (and get better results) as presented in the paper "Attention Is All You Need".

Thanks again!

Ok, the very next lecture talks about the changes above. So, I am closing the issue.