YuanGongND / ssast

Hi Yuan!

Thanks again for this next iteration of the model - it was the improvement that I was hoping for in our task!

I have a quick question regarding normalization. You mention in your AST paper:

We also normalize the input audio spectrogram so that the dataset mean and standard deviation are 0 and 0.5, respectively.

The same scheme is also used here and I wonder, since this is done on a dataloader level and the whole finetuning follows - how much of an impact on the final performance does it have? Is it mostly for the sake of speeding up convergence in training?

I am creating some versions of the model that may be used in systems with periodic retrain and I am trying to decide whether I should care about recalculating the statistics every time we train (or even at all). My finetuning that used the values from ESC-50 performed really well nevertheless.

Cheers,
Michał

YuanGongND/ast#51 (comment)
Stumbled upon this, it seems we ran into the same issue.

Hi there,

We haven't tried not using normalization for SSAST. The reason we do normalization for AST is we want to use ImageNet pretrained model, which is trained with normalized input. For SSAST, it might not be necessary, but for a fair comparison, we used the same normalization with AST.

Nevertheless, with extensive experiments, I found the model performance is not sensitive to the normalization stats (mean/std), e.g., the model performance changes very minorly even when I apply AudioSet stats for ESC-50 experiments. So I guess it is OK to use the same stats for your experiments, or even just reuse our normalization stats in the recipe.

-Yuan

Hi, Yuan:
Thank you for following up! This is interesting since your experiments seem to counter my experiment results...
If you have time, please take a look at my latest report: https://arxiv.org/abs/2203.13448 where I have acknowledged most of your work. We could discuss further or even arrange a zoom call if you think it's relevant.
Again, I appreciate your opensource effort and constant maintenance!

Thanks! I have read the paper but might need to read it more carefully.

Which specific thing is counter with your experiment? Do you mean the model performance is sensitive to the normalization stats?

Yes, I found the normalization change caused huge changes to the model performance.

I see. In the last paragraph of the paper:

In our experiment, Training AST using input of var(x) = 1 or var(x) = 0.0625 would lead to mAP of 0.09, 0.02 respectively.

I am not surprised because 1) AST uses ImageNet pretraining, distribution shifting naturally leads to some drop. SSAST might be different, but I feel normalization is still needed based on your results. 2) change var(x) to 1/0.0625 is a big change, by "model performance is not sensitive to the normalization stats", I don't mean that large change, in fact, even reusing the AudioSet norm stats for ESC-50 causes a much smaller change than this. 3) I used fbank = (fbank - self.norm_mean) / (self.norm_std) for my most recent work, the performance almost doesn't change. Is that what you mean by var(x)=1, if so, indeed our conclusions are different.

Btw, did you normalize with the same stats in the training/evaluation stages? A mismatch would cause an obvious performance drop.

Yes, I made sure normalization stats are consistent between train/eval

There could be another explanation since my 64x400 feature's mean is 20+ and std is 40+, this might cause a huge scaling issue

I see, good to learn.

Btw, it is a very nice paper.

Thank you! I plan to try a little more stuff these days, this paper is my first recent attempt on AudioSet. But I really appreciate your open spirit of discussion, and I will keep you posted.

fbank = (fbank - self.norm_mean) / (self.norm_std) (i.e., without **2 for std) certainly works well for me. It would be interesting if anyone else can report their experience with normalization.

Hi!
Thanks for your work.

Now, I want to finetune your model on another Audio DataSet. Therefore, I need to recalculate mean and variance.
And I would like to ask if the mean and variance should be calculated using only training data? And when processing validation data, are the mean and variance computed using only the training data still used?

Thanks for you reply!

And I would like to ask if the mean and variance should be calculated using only training data?

I think using the stats from the training data and validation data are both fine. We used the norm stats of the training data with augmentation if I recall correctly. In general, my experience is that the performance is not sensitive to the norm stats. But doing normalization is very important.

And when processing validation data, are the mean and variance computed using only the training data still used?

However, I think it is important to keep the norm stats consistent in training and evaluation, no matter how it is calculated.

Thanks for your clear reply！

I have another questions about normalization.

In the dataloader.py

ssast/src/dataloader.py

Line 99 in 35ae7ab

waveform = waveform - waveform.mean()

ssast/src/dataloader.py

Line 195 in 35ae7ab

fbank = (fbank - self.norm_mean) / (self.norm_std * 2)

You normalize the wav firstly, and then normalize the Fbank feature. Is this a repetitive process?
And is normalizing the waveform an important trick to improve performance?
In addition, I don't understand the normalize the Fbank feature.

ssast/src/dataloader.py

Line 195 in 35ae7ab

fbank = (fbank - self.norm_mean) / (self.norm_std * 2)

why this denominator is calculated by *2, rather than **2

fbank = (fbank - self.norm_mean) / (self.norm_std ** 2)

Looking forward to hearing from you again!

You normalize the wav firstly, and then normalize the Fbank feature. Is this a repetitive process?

The first is to remove DC, the second is to do input normalization for better network training. They are not repetitive, I guess you need both. But you can try to remove the first.

why this denominator is calculated by *2, rather than **2

It shouldn't be ** 2 anyway, we just normalize the input with a smaller variance, which seems to lead to a very minor performance improvement for the original AST (ImageNet pretraining). But I think it is safe to just do

fbank = (fbank - self.norm_mean) / self.norm_std

Note all our pretrained model are trained with self.norm_std * 2, so if you want to use our pretrained model, please keep the * 2 consistent with us. Otherwise you can just use the std.

Hi, thank you very much for the open-source code.

I also have a problem with regard to the mean/variance.

From the discuss above, I shall use the mean/varaince values of my own dataset.
Since I apply no data augmentation, I computed the mean/variance after the following processings:

wav = wav - wav.mean()
fb = torchaudio.compliance.kaldi.fbank(
            wav, 
            htk_compat=True, 
            sample_frequency=16000, 
            use_energy=False, 
            window_type='hanning', 
            num_mel_bins=128, 
            dither=0.0, 
            frame_shift=10
            )

The obtained Mean (~ -6.02) is a normal value compared with the stats in other datasets. But the Var is much larger to 16.83 (the dataset is relative smaller comparing with other datasets).

I am wondering if I did something wrong in computing stats?

Can I know what type of sound your dataset is? It is possible to have larger variance. Also do you plan to pre-train the model with your own dataset or plan to use our pretrained checkpoint? For the later, you might need to be more careful for input normalization.

-Yuan

Thanks a lot for the reply!

I am using the DCASE challenge task 4 dataset.

Here is the code for normalization:

running_stats = []
    filenames = glob(path_1 + "*.wav") + glob(path_2 + "*.wav")
    element = 0
    for file in tqdm(filenames):
        wav, _, _, _ = read_audio(file, random_channel=False, multisrc=False, pad_to=None)
        wav = (wav - wav.mean()).unsqueeze(0)
        melspec = torchaudio.compliance.kaldi.fbank(
            wav, 
            htk_compat=True, 
            sample_frequency=audio_configs["sr"], 
            use_energy=False, 
            window_type='hanning', 
            num_mel_bins=audio_configs["n_mels"], 
            dither=0.0, 
            frame_shift=10
            )
        running_stats.append(melspec)
        element += melspec.numel()
    # calculate mean
    running_mean = 0
    for emd in running_stats:
        running_mean += emd.sum().item() / element

    running_std = 0
    for emd in running_stats:
        running_std += ((emd - running_mean) ** 2).sum().item() / element
    running_std = running_std

What am doing is to extract frame-level features from the frame-based pretrained frame-wise AST model. So I think the normalization is a fairly important factor for my experiments.

Also, to extracted features, I use the SSAST-Base-Frame-400 from the link in this repo and obtain the features by the finetuningavgtok() in the model file.

Here is the code:

    def finetuningavgtok(self, x):
        B = x.shape[0]
        x = self.v.patch_embed(x)
        if self.cls_token_num == 2:
            cls_tokens = self.v.cls_token.expand(B, -1, -1)
            dist_token = self.v.dist_token.expand(B, -1, -1)
            x = torch.cat((cls_tokens, dist_token, x), dim=1)
        else:
            cls_tokens = self.v.cls_token.expand(B, -1, -1)
            x = torch.cat((cls_tokens, x), dim=1)
        x = x + self.v.pos_embed
        x = self.v.pos_drop(x)
        for blk_id, blk in enumerate(self.v.blocks):
            x = blk(x)
        x = self.v.norm(x)
        # average output of all tokens except cls token(s)
        # x = torch.mean(x[:, self.cls_token_num:, :], dim=1)
        # x = self.mlp_head(x)
        return x

I noticed that, the features from the pretraining model are pretrained with a time stride of 2; while the finetuned model for downstream tasks are trained with a stride of 1. If I want to leverage the feature directly (freeze the AST), I think I should initialize the AST model with tstride=2, right?

Namely,

ASTModel(label_dim=xxx, fshape=128, tshape=2, fstride=128, tstride=2, input_fdim=128, input_tdim=xxx, model_size="base", pretrain_stage=False, load_pretrained_mdl_path=pretrained_path)

Hi there,

Thanks for the clarification.

First, I want to remind that -

What am doing is to extract frame-level features from the frame-based pretrained frame-wise AST model. So I think the normalization is a fairly important factor for my experiments.

I assume you want to use the (frozen) embedding for some kind of downstream task, potentially for a challenge (i.e., numbers matter). According to my experience, using SSL pretrained embedding in frozen setting is not optimal for this purpose, using finetuned embedding would be much better. This might be true for almost all SSL model, the reason is that during training, the model hasn't seen any label.

I guess one reason is you want temporal-orded representation, while AST and other patch-based model do not natively support this. But you can mean pool the frequency dimension to make the representation in temporal order, e.g., if the output of AST is (64, 8) in (time, freq), you can mean pool the second dim to make it in shape (64). In my opinion, this is a better solution to use SSAST without finetuning.

The typical usage is that you have some special data, many of them are unlabelled, and you can pretrain SSAST with the unlabaled data and then do a finetuning.

-Yuan

Dataset mean / stdev normalization