airsplay / R2R-EnvDrop

You claim an enhanced version of Speaker in section 3.4.3. However, geographic information and actions are only used to calculate the weight of features in attention mechanism.

I have difficulty understanding why g,a are not used to directly calculate the context. Could you provide some works related to the motivation of this design?

Thanks for pointing it out.

I used a trick "fused hidden state" in implementing the attention layer here:

R2R-EnvDrop/r2r_src/model.py

Line 122 in 4c11585

h_tilde = torch.cat((weighted_context, h), 1)

.

Mathematically, it would "add" the information of query into the retrieved context vectors:

c   = Att(query, {key})
out = FC([query, c])

Thus, the information of g, a would be captured by the second LSTM.

I am sorry that I forget to mention it in the paper.

Questions about Enhanced Speaker