Motivation for top-down input added to value matrix

Question

Motivation for top-down input added to value matrix

Kuan-Pang opened this issue 9 months ago · comments

Hi - thanks for the amazing work!

In the second feedforward pass (step iv), the value matrix receives the top-down input to steer the attention map. I was wondering what is the motivation for this design decision, e.g. why does the value matrix specifically receive this signal, but not query/key matrices?

Baifeng Shi · Answer 1 · Fri Dec 15 2023 18:14:57 GMT+0800 (China Standard Time)

Hi, thanks for your interest in our work.

This design follows our previous paper on top-down attention (https://arxiv.org/pdf/2303.13043.pdf). The intuition is that, Q and K decide which pixels belong to the same object and should be grouped together (reflected in the attention matrix QK^T), and V decides the grouped feature of each object. We keep Q and K the same so that the grouping of each object is the same (e.g., if there's a cat and a dog in the image, two pixels on the cat belong to the same object and should be grouped together, no matter if we are looking at the cat or the dog), and we add the top-down feature on V to change/enhance the feature of the specific object we are focusing at.

Kuan Pang · Answer 2 · Mon Jan 01 2024 20:48:51 GMT+0800 (China Standard Time)

Thanks for the reply!