andrewowens / multisensory

Code for the paper: Audio-Visual Scene Analysis with Self-Supervised Multisensory Features

Home Page:http://andrewowens.com/multisensory/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Question about the shift_net.py' training

xiaoyiming opened this issue · comments

It's really nice work. However, I met some problems when I read the shift_net.py. As described follows:
ims = self.inputs[i]['ims']
samples_ex = self.inputs[i]['samples']
assert pr.both_examples
assert not pr.small_augment
labels = tf.random_uniform(
[shape(ims, 0)], 0, 2, dtype = tf.int64, name = 'labels_sample')
samples0 = tf.where(tf.equal(labels, 1), samples_ex[:, 1], samples_ex[:, 0])
samples1 = tf.where(tf.equal(labels, 0), samples_ex[:, 1], samples_ex[:, 0])
labels1 = 1 - labels

      net0 = make_net(ims, samples0, pr, reuse = reuse, train = self.is_training)
      net1 = make_net(None, samples1, pr, im_net = net0.im_net, reuse = True, train = self.is_training)
      labels = tf.concat([labels, labels1], 0).

My understanding is that the samples_ex is the stereo audio with the size of batch_size X N X 2(N is the length of the audio signal). However, why is the labels is variable ? Should it be constant (means 0 denotes not synchronized and 1 denotes synchronized) ? I'am looking for your reply.

Yes, it probably would have made more sense to make labels a constant. I did it this way so that each GPU's mini-batch had equal numbers of non-shifted and shifted examples, and so that every example appears twice (both as a shifted and non-shifted). I don't think this was necessary, though.

@andrewowens thanks for reply! However, I met some other problems. In the shift_dest.py
feats['im_0'] = tf.FixedLenFeature([], dtype=tf.string)
feats['im_1'] = tf.FixedLenFeature([], dtype=tf.string)
one: what's is stored in the 'im__0' and 'im_1' ?
two:It is the output of tf.gfile.FastGFile function ?
three: the 'im_0' include the first half frames of the video and the 'im_1' include the second half frames of the video
four: If three is true,why divide a video into two parts?
I'am looking for your reply.

commented

@andrewowens thanks for reply! However, I met some other problems. In the shift_dest.py
feats['im_0'] = tf.FixedLenFeature([], dtype=tf.string)
feats['im_1'] = tf.FixedLenFeature([], dtype=tf.string)
one: what's is stored in the 'im__0' and 'im_1' ?
two:It is the output of tf.gfile.FastGFile function ?
three: the 'im_0' include the first half frames of the video and the 'im_1' include the second half frames of the video
four: If three is true,why divide a video into two parts?
I'am looking for your reply.

The same question. Looking forward to a sample code of generating shift dataset.