Why is foreground prediction necessary?

Question

Why is foreground prediction necessary?

max810 opened this issue 2 years ago · comments

Hello,
First of all, good job with the paper! Nicely written and explains a lot of concepts pretty well.
However, I am still a little puzzled about why is predicting foreground (or foreground residual in this case) necessary in the pipeline.
Consider this example from the demo:

For the composition (the final step) - why do we use the pixels from upsampled foreground and not from the original image? They are supposed to be identical anyway, because we explicitly train the coarse foreground prediction to replicate the pixels from the original image (in the alpha mask region) (formula 2):

A possible answer is mentioned in Issue#19, but it's unclear to me what the "background color spill onto partial-opacity hairs and edges" looks like and how does foreground prediction branch mitigate this issue.

I would greatly appreciate an explanation and/or just a side-by-side comparison of 2 images (original vs predicted foreground).

Thank you in advance!

Peter Lin · Answer 1 · Mon Jan 10 2022 14:21:28 GMT+0800 (China Standard Time)

PeterL1n/RobustVideoMatting#42

Peter Lin · Answer 2 · Mon Jan 10 2022 14:22:38 GMT+0800 (China Standard Time)

The foreground is only equal to the source on regions where alpha = 1. But for semitransparent regions, it is not, because part of the original background will leak through. These regions are usually hair, silhouette, and motion blur.

Maksym Bekuzarov · Answer 3 · Mon Jan 10 2022 14:29:06 GMT+0800 (China Standard Time)

But the foreground will be learned to be the same as the original pixels for all alpha > 0, not just alpha = 1, no?

Peter Lin · Answer 4 · Mon Jan 10 2022 14:31:24 GMT+0800 (China Standard Time)

No, it won't. The dataset provides ground truth foreground F and alpha a. We composite them to a background to synthesize a synthetic source input I = aF + (1-a)B. The model predicts foreground F' and alpha a'. The loss on F' is from ground truth F.

Maksym Bekuzarov · Answer 5 · Mon Jan 10 2022 15:41:26 GMT+0800 (China Standard Time)

Oh, so we use the properly-extracted foregrounds from the datasets and the model directly learns to remove the background in those situations you described (hair strands, motion blur, etc.). I missed that, sorry.

Thanks for the explanation!