YuanGongND / ssast

Thanks for sharing the code for your work - which is awesome btw - and for the excellent documentation.

I modified your wrapper scripts as you suggested and ran pretraining on my own dataset of ~900 or so speakers and ~1500 or so samples.

However, I'm not really interested in proceeding with a fine-tuning stage and instead would like to extract the (?mean pooled) final embedding for each sample. I.e. instead of predicting a label I just want the embedding.

Is this a functionality that is relatively simple to hack onto your existing code?

Hugo

I am wondering if modifying these two lines and using finetuningavgtok as the task in forward could help your application (basically comment out the mlp head)?

ssast/src/models/ast_models.py

Lines 262 to 264 in b589c9c

    
           x = torch.mean(x[:, self.cls_token_num:, :], dim=1) 
        
           x = self.mlp_head(x) 
        
           return x

Btw, I would suggest first trying the original inference code to see if the prediction accuracy is as expected and then extract the embedding. This helps avoid mistakes in the inference pipeline (e.g., input normalization, model parallelization, etc).

-Yuan

Perfect, thanks so much.

I ended up defining 'finetuningavgtok_embed' as basically 'finetuningavgtok' but with the last line commented as another task option in ASTModel. Not sure if that will be better or if the one with the last two lines commented, as you suggested, would be better. But regardless, some options for embeddings to test!

Hugo

Hi @HugoBothaMD , I am also interested to use the encoder part of this model for audio event classification task. It would be of great help if you could guide me how i could make use of the model/code in this repo for my application. I would like to have the below setup.

I would like to generate the embeddings for my audio events of 1 second duration.
def get_embeddings_ssast(audio_event):
#define the ssast model
#load pretrained weights
#create a alternate model which returns the embeddings
#generate the embeddings
#return the mebeddings

I would like to use these embeddings as input features in my classifier.

Hi @HugoBothaMD , I am also interested to use the encoder part of this model for audio event classification task. It would be of great help if you could guide me how i could make use of the model/code in this repo for my application. I would like to have the below setup.

I would like to generate the embeddings for my audio events of 1 second duration. def get_embeddings_ssast(audio_event): #define the ssast model #load pretrained weights #create a alternate model which returns the embeddings #generate the embeddings #return the mebeddings

I would like to use these embeddings as input features in my classifier.

Hii, Do you finish it? Thank you

@sreenivasaupadhyaya @Mortyzhou-Shef-BIT
Hi there,
I agree SSAST is a cool modeling approach. However, it is not my repo/project and I do not have bandwidth to help you on your use case. Commenting out the lines above as suggested by @YuanGongND worked for my use case. Probably best to close this thread as resolved.
Hugo

	x = torch.mean(x[:, self.cls_token_num:, :], dim=1)
	x = self.mlp_head(x)
	return x

Embeddings without fine tuning