Mask Module Cost Volume Aggregation Method
fengziyue opened this issue · comments
Ziyue Feng commented
Thank you for sharing the great work!
I'm curious about why you choose max-pooling to combine the multiple cost volumes for the mask module, is it only because the max-pooling works for arbitrary input frames numbers? or do you have more theoretical analysis or assumptions? have you tried other aggregation methods like SUM, AVG, CONCAT, etc..?
Thank you again!
Felix Wimbauer commented
Hi!
The reason we chose Max pooling are three-fold:
- It works on any number of frames
- The idea is that the mask module encodes the single frame CV in a feature vector. When we have inconsistency, we will have different feature activations for the different single frame CVs. By Max-Pooling, we keep the different activations, so that they can be picked up by the decoder. Sum, Avg, etc. would reduce the difference, which we don't want.
- Max Pooling has often been used for similar tasks in the literature.