lmb-freiburg / Multimodal-Future-Prediction

Hello I have questions about input and labels about figure3.(paper) when you are training.

Can you please elaborate the training process for EWTA including inputs and ground truths labels?

Figure3. mentions 3 ground truths but what are they???
My experience of training MDN my training process included as below,

label: one ground truth path(X,Y)
input: object detected images(rasterized image shape in 16, 25, 300, 300(Batch, C, H, W)
model(input) outputs 3 modes of hypothesis(pi, mean, sigma * 3 modes)
mdn loss function

I assume you started with 8 modes in your paper(8 black dots) which contains pi, mean, (sigma for EWTAD).
Am I on the right track??......

Hi,
Just to clarify something, figure 3 is only used to explain how the optimization of our loss function works.

The three ground truths are not available at the same time. You will see only one ground truth at every iteration, but during training you will see all of them.

The figure 3 explains only the sampling framework where we have EWTA. For simplicity, you can assume the simpler version of our approach where in the sampling network you generate multiple hypotheses (a set of points (x,y)) and then during fitting, you fit those hypotheses into your final mixture model.

In practice, we train the sampling network to generate 20 hypotheses and then fit them into 4 modes (as mentioned in section 6.1).

To get an idea about the EWTA loss implementation, we have already provided the code for the loss function.

Multimodal-Future-Prediction/net.py

Line 66 in d0a5d0f

    
           def make_sampling_loss(self, hyps, bounded_log_sigmas, gt, mode='epe', top_n=1):

We also provided the loss function used in the fitting network (nll) at:

Multimodal-Future-Prediction/net.py

Line 138 in d0a5d0f

def make_fitting_loss(self, means, bounded_log_sigmas, mixture_weights, gt):

Feel free to raise more questions if you still need help.

Thank you for your explanation!

To confirm, can you please elaborate this sentence?
"The three ground truths are not available at the same time. You will see only one ground truth at every iteration, but during training you will see all of them."

Does it mean, we use 3 ground truth labels(3 future trajectories(x,y position data on image)) paired with one image when we are training???

No, we use only one ground truth. Every training sample has an input (e.g, image) and a single ground truth. We generate multiple hypotheses (e.g, 8 or 20) and use the EWTA loss function (make_sampling_loss() in our repository) which takes a set of hypotheses (hyps) and a single ground truth (gt).

What we mean by figure 3 is that during training, for some iteration we see an image with its single ground truth and for another iteration (maybe after a long time) the network sees a similar input image with a different ground truth. The EWTA loss function will encourage the network to use one head in the first case while using another head in the latter case.

Thank you for the explanation!

Figure3 on paper