YuanGongND / ssast

Code for the AAAI 2022 paper "SSAST: Self-Supervised Audio Spectrogram Transformer".

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Stereo audio

matthiasanderer opened this issue · comments

Would this also work for stereo (i.e. 2 channel) audio?

I wonder how to best adapt the code to this. (Especially that the timm parts have been trimmed down from 3 channels to 1 channel anyway)

Hi,

I think it is doable, even with our pretrained model.

  1. These are where we select the first channel, you need to change these.

temp_wav[0, 0:waveform2.shape[1]] = waveform2

waveform2 = waveform2[0, 0:waveform1.shape[1]]

  1. You also need to work on fbank extraction to make sure the output is two channel.

fbank = torchaudio.compliance.kaldi.fbank(waveform, htk_compat=True, sample_frequency=sr, use_energy=False,

This includes a new dim which were squeezed for single-channel fbanks. So you also need to take care of the input pre-processing at the model side

x = x.unsqueeze(1)

Note we did this for multiple forward pass and above is just one of them.

  1. Then you need to change the model size to take two channels instead of one.

new_proj = torch.nn.Conv2d(1, self.original_embedding_dim, kernel_size=(fshape, tshape), stride=(fstride, tstride))

In short, it needs some (careful) changes of the code, but is doable. I am not sure about your purpose, but it will be easier if you can add the two channels as a single channel.

-Yuan