lmb-freiburg / Multimodal-Future-Prediction

Are these lines fc7,8 in Table8 in thesis?
https://github.com/lmb-freiburg/Multimodal-Future-Prediction/blob/master/encoder.py#L92-L97

Second question is, in Step2,3 below, should I use nll loss when I'm training?
Step1. training sampling netwrok(fitting network freeze)
Step2. training fitting network(sampling network freeze)
Step3. Train all

Third question is, how many iteration do you recommend for me to train for Step2, Step3 if I trained Step1 30,000*5 iterations?

Fourth question is, when I trained fitting network sometimes loss went into minus. Was it same for you? Do you recommend tanh activation function?(which seems to be not on the paper but on the code)

Question 1: yes.

Question 2: yes.

Training the first stage is the most important part and would require much of the training time. In your case, you could train the step 2 for 30k and step 3 for 10k. The best is to see the training loss/ validation loss behavior. Note that you need to lower the learning rate when training step 3.

You mean the tahn function on top of the first layer of the fitting network. Yes I would use it.
I do not think I observe a negative loss.

Thanks for your answers! I have more questions.

First, do you recommend for me to replace sampling network to pre-trained net like Resnet18?? If not, do you have certain reason?

Second question is, can I use batch normalization and Relu instead of Tanh and Dropout at fitting network? If not, do you have certain reason?

Third question is, if I set sampling network's the number of hypothesis as 40, is it 40 * 3 modes or 40 modes?

Last question is, after I train sampling network 10000* 5 times the loss goes down to around 17 but when it starts to train fitting network the loss goes up to very high number(like 8000) and doesn't decrease... Do you think I should train sampling network more????

Hi,

In our task, using a pretrained ResNet would not help because the input are not single images. If your task could benefit from such a pretrained model, then yes you can do it.
Of course, using batch normalization and Relu will work but no sure if it brings improvements. You can try that and share with us the conclusion.
By setting the number of hypotheses to 40, it means that you are generating in the first network 40 different outputs. Then it depends on the number of modes you will set in the fitting network which determines the final number of modes your mixture model will have. The number of modes is determined by setting num_output at:

Multimodal-Future-Prediction/net.py

Line 48 in d0a5d0f

predicted = tf_full_conn(intermediate_drop, name='predict_fc1', num_output=20 * 4)

For example, if your sampling network generates 40 hypotheses and you want to fit them into 4 modes, then you set num_output=40*4
The magnitude of the NLL loss (used in the fitting) is different from the sampling loss (EWTA). Therefore, it is fine that they are not comparable. However, having a very high value for the NLL does not make sense. Also the loss in the fitting network will only decrease at the beginning of the training and then stays stable because training the fitting network while fixing the sampling one is not a hard task. Double check the assignments outputs of the fitting network to see if they are meaningful.

Thank you for your explanation.
I want to confirm that the difference between modes and hypotheses in your thesis.
Two pictures below are one set of the result of your test.py code.
In the first picture, it seems like there are 20 hypotheses * 2outputs(x,y coordinate)
In the second picture of the result I can see 4 distributions.(Seems like 4 modes)
However, the sampling network output shape seems like(batch, 20 x 4) as below.

Multimodal-Future-Prediction/net.py

Line 42 in d0a5d0f

    
           input_2 = tf.concat([hyps_concat, log_scales_concat], axis=1)  # (batch, 20*4, 1, 1)

Also, the fitting network output shape seems like the same as sampling network output shape(batch, 20 x 4)

Multimodal-Future-Prediction/net.py

Line 48 in d0a5d0f

    
           predicted = tf_full_conn(intermediate_drop, name='predict_fc1', num_output=20 * 4)

Originally, I understood as below though,
sampling network output shape : 20hypotheses * 2ouptuts(x,y) --> 20 green boxes in the first picture
fitting network output shape : 4modes(Make 4 distribution using 20 hypotheses) * 2outputs(x,y) --> 4 set of 2D(x,y) distributions.

Can you please let me know which part I'm missing???

Hi,

The sampling network outputs 20 hypotheses, each is a unimodal distribution (2 for the mean and 2 for the sigma), thus the shape (20*4). This is what we call in the paper EWTAD. Note that the first picture shows only the means where we draw a set of bounding boxes of same size centered at the predicted means.

The fitting network outputs assignment vectors of shape (20*4) (referred as z_k in eq 6 in the paper). The fitting network takes as input the set of hypotheses generated from the sampling network and outputs for each hypothesis an assignment vector of shape (4) (number of modes). In other words, the fitting network computes the assignments of each hypothesis to the final mode. For example, if the first hypothesis should be assigned to the third mode, then the assignment of the first hypothesis will be (0,0,1,0) and so on. Note that these assignment vectors are between 0 and 1 and should sum to 1.

Then the function tf_assemble_lmm_parameters_independent_dist:

Multimodal-Future-Prediction/net.py

Line 50 in d0a5d0f

    
           means, bounded_log_sigmas, mixture_weights = tf_assemble_lmm_parameters_independent_dists(samples_means=out_hyps,

takes these assignments and the input hypotheses (a set of indpendent unimodal distributions in the form of means and sigmas) and output the final multimodal mixture model.

Thank you for your reply. Also, really really appreciate your quick reseponse!

I tried to understand your code and came up with more questions below.

Here, when you are using tf.fill(), shouldn't we use shape as the first parameter?? You used diff2 and it seems like not the shape.. so

Multimodal-Future-Prediction/net.py

Line 147 in 3ec8f66

diff2 = tf.add(diff2, tf.fill(diff2, eps))
Shouldn't this be means[i] instead of hyps[i]

Multimodal-Future-Prediction/net.py

Line 146 in 3ec8f66

diff2 = tf.square(gt - hyps[i]) # (batch,2,1,1)
What are these nd.ops.mul I couldn't find that operation.

Multimodal-Future-Prediction/net.py

Line 155 in 3ec8f66

sxsy = nd.ops.mul(sigma[:, 0:1, :, :], sigma[:, 1:2, :, :])
Should I use this "out_hyps" and "out_log_sigmas" for make_sampling_loss input?
In this net.py there was bounded_log_sigmas which comes out after fitting network, and the sampling loss function takes bounded_log_sigmas, so I wasn't sure which are the inputs for that sampling loss function

Multimodal-Future-Prediction/net.py

Line 39 in 3ec8f66

out_hyps, out_log_sigmas = self.disassembling(output)

Multimodal-Future-Prediction/net.py

Line 66 in 3ec8f66

    
           def make_sampling_loss(self, hyps, bounded_log_sigmas, gt, mode='epe', top_n=1):

I'm keep getting negative loss. As "b" gets smaller loss becomes negative. Do you think this is fine?
Graph below is training steps from "iul(5000 iterations)" to fitting net(5000 iterations)
Before this I trained sampling network(5 stages * 5000=25000 iterations)

Hi @droneRL2020

Sorry for being late to respond, we were quite busy with a deadline.

1- I think tf.fill can also take as first argument a tensor and will only use its dimensions (shape). Of course you can directly use the shape.

2- Yes, you are right. I will update it accordingly.

3- You can simply replace it with tf.multiply. I will update it as well.

4- Yes, the output of the self.disassembling(output) is out_hyps and out_log_sigmas where the out_log_sigmas are already the bounded_log_sigmas (see

Multimodal-Future-Prediction/net.py

Line 17 in 3ec8f66

    
           bounded_log_sigmas = [tf_adjusted_sigmoid(log_sigmas[i], -6, 6) for i in range(len(log_sigmas))]

).

5- In my experiment, I do not have negative loss values. But I think it is not wrong to have negative ones. Just check the final mixture model parameters and see if it make sense (e.g, plot the mixture model distribution).

Questions about sampling network training process