model architecture

Question

model architecture

riyaj8888 opened this issue 3 years ago · comments

can anyone briefly explain how the audio and video features are fused together

?
please use above image as reference which is from org paper

Hurizma1003 · Answer 1 · Wed Oct 13 2021 14:52:03 GMT+0800 (China Standard Time)

we apply a small number of 3D convolution and pooling operations to the video stream, reducing its temporal sampling rate by a factor of 4. We also apply a series of strided 1D convolutions to the input waveform, until its sampling rate matches that of the video network. We fuse the two subnetworks by concatenating their activations channel-wise, after spatially tiling the audio activations.

I don't understand this part from paper please assist in this