Mask Module Cost Volume Aggregation Method

Question

Mask Module Cost Volume Aggregation Method

fengziyue opened this issue 2 years ago · comments

Thank you for sharing the great work!
I'm curious about why you choose max-pooling to combine the multiple cost volumes for the mask module, is it only because the max-pooling works for arbitrary input frames numbers? or do you have more theoretical analysis or assumptions? have you tried other aggregation methods like SUM, AVG, CONCAT, etc..?

Thank you again!

Felix Wimbauer · Answer 1 · Fri May 20 2022 23:30:10 GMT+0800 (China Standard Time)

Hi!

The reason we chose Max pooling are three-fold:

It works on any number of frames
The idea is that the mask module encodes the single frame CV in a feature vector. When we have inconsistency, we will have different feature activations for the different single frame CVs. By Max-Pooling, we keep the different activations, so that they can be picked up by the decoder. Sum, Avg, etc. would reduce the difference, which we don't want.
Max Pooling has often been used for similar tasks in the literature.