Equation 7 in DeepSeek-V2 Technical Report .
whatdhack opened this issue · comments
I am trying to understand equation 7 in the DeepSeek-V2 tech report. . Here are the confusions I am having.
- qti, kti, and vti are row vectors ? shapes are ( 1, dh) ?
- qti^T.Kji shape is (dh,1) . (1, dh) = (dh,dh) ?
- Softmax_j implies over the -1 dim ?
- qti^T.Kji . vji , would it compute as (dh, dh) . ( 1, dh) ?
- What is the summation over j for ?