Use AudioVec Operator for Clustering Audio
aatmanvaidya opened this issue · comments
- Setup Feluda
- Convert video files to audio
- Try out Feluda AudioVec (or something you like) and use t-SNE (or other approaches) to evaluate clustering visually
- can do this in a jupyter notebook and show results
- Look at other pre-trained models to convert audio files to embeddings.
assigning this issue to @Chaithanya512
A possible dataset to use - https://www.kaggle.com/datasets/mmoreaux/environmental-sound-classification-50
@Chaithanya512 can you comment on the issue so we can assign it to you?
Hi, thank you for the information.
- Setup Feluda and run tests for AudioVecEmbedding Operator
- Collect a dataset of 150-200 Audio Files
- Run Feluda AudioVec Operator on a the dataset, reduce dimensions using t-SNE and do a visual plot - This will act as a baseline for us
- Try out different Embedding Models
Hi,
Status:
- Had set-up feluda and ran tests for AudioVec Operator.
- Curated a dataset by selecting 200 audio files from the environmental dataset. The original dataset contains 50 classes and 40 audio files for each class, I selected 10 random classes and gathered 20 files from those classes (10*20) + 114 audio files generated from Youtube videos. The total dataset contains 314 audio files.
- Tested AudioVec Operator on the dataset, reduced dimensions using t-SNE, and visualized using plotly. Also, tested the original dataset (all 2000 files) using AudioVec Operator.
- Tried out OpenL3 embedding model.
- play the audio from the plot in colab
- try out more embedding models - AST
- do a review of lit for any video based transformer models are trained on audio
google experiment clustering code - https://github.com/googlecreativelab/aiexperiments-drum-machine
@Snehil-Shah thanks for sharing, looks interesting, @Chaithanya512 do try implementing it, I have found a direct PyTorch implementation of it - https://github.com/harritaylor/torchvggish/tree/f70241ba, you can download the model from PyTorch Hub. (this should hopefully simplify things for you)
I am also sharing some other audio embedding models I found
Note - this is just a dump of things I am finding, no pressure to try out everything, you can evaluate and take a call on what to try/ not try.
- CLAP: Learning Audio Concepts From Natural Language Supervision - official code, Hugging Face, Microsoft python library.
- AST: Audio Spectrogram Transformer - Hugging Face
- AI4Bharat IndicWav2Vec - this model is specific for speech thought (but do read up more on this) - official code
- HuBERT - Hugging Face
- Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models - Hugging Face, sample code
- L3-Net Deep Audio Embeddings - code - this looks built-on/ similar to VGGish that @Snehil-Shah shared.
- COLA: Contrastive learning of general purpose audio representations - blog, code, pytorch code
- SurfPerch - more for environment sounds from a cursory reading.
Let me know if I can help out in any other way!
sure, thank you for sharing the embedding models. I will look into these.
- Let's keep trying out more transformer based/ CNN ensemble based models
- Evaluate clustering results
- Do a review of lit for other clustering algorithms
Updated Notebook: https://colab.research.google.com/drive/1lBrWCyUsuCSTOEUUqDwfc6FzpQWO0ETt?usp=sharing
- Implemented KMeans Clustering on AudioVec Operator embeddings.
- Evaluated Audio Spectrogram Transformers (AST), VGGish, HuBERT.
Comments: AST performs better as of now, AudioVec Operator with KMeans also shows good results.