andrewowens / multisensory

Code for the paper: Audio-Visual Scene Analysis with Self-Supervised Multisensory Features

Home Page:http://andrewowens.com/multisensory/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

In which way the video frames combine

tuffr5 opened this issue · comments

Thanks for sharing your great work with us. But I have a question here, it is somewhat opaque in your code that I can not find the way you deal with the multiple frames. Is that you simply tile all frames together and then feed it into the "img_net"?
waiting for your reply, thanks so much.

From 'sep_example.tf' given by the author, it can be seen that video frames are concatenated vertically and then stored in .tf files.

Thanks for your gentle answer. But I have no idea where the sep_example.tf is? Can you please tell?

#11 (comment) You can get it here.

Thanks so much.

Hi, do you know what are the labels like? Since it says in the paper that there is no human labeling, I wonder what is the label like.

You can look up in the code 'shift_net.py'.

Ok, thank you. Actually, I was confused by the code, so I was asking for the answer. Code as following:
"labels = tf.random_uniform([shape(ims, 0)], 0, 2, dtype=tf.int64, name='labels_sample')
samples0 = tf.where(tf.equal(labels, 1), samples_ex[:, 1], samples_ex[:, 0])
samples1 = tf.where(tf.equal(labels, 0), samples_ex[:, 1], samples_ex[:, 0])
labels1 = 1 - labels

net0 = make_net(ims, samples0, pr, reuse=reuse, train=self.is_training)
net1 = make_net(None, samples1, pr, im_net=net0.im_net, reuse=True, train=self.is_training)
labels = tf.concat([labels, labels1], 0)"

  1. labels are generated by random, why is that?
  2. why there are two nets, net0, net1, do they have any relationship? I can't see it from the paper.

Thanks so much.

#16 look at this

Thank you. But It is still not clear, right?