YuanGongND / ast

Code for the Interspeech 2021 paper "AST: Audio Spectrogram Transformer".

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

What is the objective when pretraining?

Young973 opened this issue · comments

TBH, I'm a little confused about what is the objective when pretraining with AST? It seems it is not indicated in the paper. BTW, when pretraining SSAST discriminative objective is the classification with InfoNCE and generative objective is reconstruction. But what is it in AST?

hi there,

It is just ImageNet pretraining.

I.e., using ImageNet pretrained DeiT as the initial weight for AST.

if model_size == 'tiny224':
self.v = timm.create_model('vit_deit_tiny_distilled_patch16_224', pretrained=imagenet_pretrain)
elif model_size == 'small224':
self.v = timm.create_model('vit_deit_small_distilled_patch16_224', pretrained=imagenet_pretrain)
elif model_size == 'base224':
self.v = timm.create_model('vit_deit_base_distilled_patch16_224', pretrained=imagenet_pretrain)
elif model_size == 'base384':
self.v = timm.create_model('vit_deit_base_distilled_patch16_384', pretrained=imagenet_pretrain)
else:

-Yuan

If you mean audio domain pretraining, that is just train AST on AudioSet (based on ImageNet initialization) with BCE loss for classification task. You can then take the model for other audio tasks (e.g., for ESC-50).