The formula for computing attention biases in GREAT differs between the code and the paper

Question

The formula for computing attention biases in GREAT differs between the code and the paper

jonathan-laurent opened this issue 3 years ago · comments

In the paper, you are proposing the following formula for computing the attention bias b_{i,j}:

b_{i,j} = dot(W, e) + b where e is the edge embedding (of dimension N, which is typically chosen to coincide with the attention dimension per head), W is a vector of size N and b is a scalar.

However, it seems to me that the code implements a slightly different formula, which is b_{i,j} = dot(e, b) where b is a vector of size N (called self.bias_scalar in the code) and e is the edge embedding (in the code, self.bias_embs contains embeddings for all edge types).

I was wondering if this change was motivated or if it is essentially insignificant.

Vincent Hellendoorn · Answer 1 · Thu May 13 2021 04:52:38 GMT+0800 (China Standard Time)

Hi, thanks for reaching out. You are correct: the difference between the two is the final addition of a edge-type-specific scalar bias, called b_e in the paper, which is absent in the code. Otherwise they are the same in that they project an N-d edge embedding to a scalar using an N-d weight. Properly, that bias term should be added to the bias here as well, which would require another parameter (let's call it bias_bias?) of shape bias_dim. Happy to detail how to add it.

As for relevance, it's difficult to say: it was present in the original formulation because it makes sense to have a query/key-independent term per edge type, and I certainly see no harm in adding it given that it incurs nearly no extra parameters. At the same time, I've noticed no difference without it in this public replication compared to the Google-internal one that had it, so if it matters, it is not by much (at least not on this task). I don't think I'll add it to the current implementation because it would invalidate the pretrained model checkpoints; but, adding it to a new run might help and certainly won't hurt.

-Vincent

Jonathan Laurent · Answer 2 · Thu May 13 2021 05:36:30 GMT+0800 (China Standard Time)

Thanks, this is very helpful!