YuanGongND / ssast

Would this also work for stereo (i.e. 2 channel) audio?

I wonder how to best adapt the code to this. (Especially that the timm parts have been trimmed down from 3 channels to 1 channel anyway)

Hi,

I think it is doable, even with our pretrained model.

These are where we select the first channel, you need to change these.

ssast/src/dataloader.py

Line 112 in a1a3eec

temp_wav[0, 0:waveform2.shape[1]] = waveform2

ssast/src/dataloader.py

Line 116 in a1a3eec

waveform2 = waveform2[0, 0:waveform1.shape[1]]

You also need to work on fbank extraction to make sure the output is two channel.

ssast/src/dataloader.py

Line 126 in a1a3eec

    
           fbank = torchaudio.compliance.kaldi.fbank(waveform, htk_compat=True, sample_frequency=sr, use_energy=False,

This includes a new dim which were squeezed for single-channel fbanks. So you also need to take care of the input pre-processing at the model side

ssast/src/models/ast_models.py

Line 436 in a1a3eec

x = x.unsqueeze(1)

Note we did this for multiple forward pass and above is just one of them.

Then you need to change the model size to take two channels instead of one.

ssast/src/models/ast_models.py

Line 130 in a1a3eec

    
           new_proj = torch.nn.Conv2d(1, self.original_embedding_dim, kernel_size=(fshape, tshape), stride=(fstride, tstride))

In short, it needs some (careful) changes of the code, but is doable. I am not sure about your purpose, but it will be easier if you can add the two channels as a single channel.

-Yuan

Stereo audio