Use AudioVec Operator for Clustering Audio

Question

Use AudioVec Operator for Clustering Audio

aatmanvaidya opened this issue a month ago · comments

Aatman Vaidya commented a month ago

Setup Feluda
Convert video files to audio
Try out Feluda AudioVec (or something you like) and use t-SNE (or other approaches) to evaluate clustering visually
- can do this in a jupyter notebook and show results
Look at other pre-trained models to convert audio files to embeddings.

Aatman Vaidya · Answer 1 · Fri Jun 14 2024 20:42:26 GMT+0800 (China Standard Time)

assigning this issue to @Chaithanya512

Aatman Vaidya · Answer 2 · Sun Jun 16 2024 15:26:20 GMT+0800 (China Standard Time)

A possible dataset to use - https://www.kaggle.com/datasets/mmoreaux/environmental-sound-classification-50

Denny George · Answer 3 · Sun Jun 16 2024 15:38:30 GMT+0800 (China Standard Time)

@Chaithanya512 can you comment on the issue so we can assign it to you?

Chaithanya kalyan · Answer 4 · Sun Jun 16 2024 16:11:58 GMT+0800 (China Standard Time)

Hi, thank you for the information.

Aatman Vaidya · Answer 5 · Tue Jun 18 2024 20:24:53 GMT+0800 (China Standard Time)

Setup Feluda and run tests for AudioVecEmbedding Operator
Collect a dataset of 150-200 Audio Files
Run Feluda AudioVec Operator on a the dataset, reduce dimensions using t-SNE and do a visual plot - This will act as a baseline for us
Try out different Embedding Models

Chaithanya kalyan · Answer 6 · Wed Jun 26 2024 02:43:20 GMT+0800 (China Standard Time)

Hi,

Status:

Had set-up feluda and ran tests for AudioVec Operator.
Curated a dataset by selecting 200 audio files from the environmental dataset. The original dataset contains 50 classes and 40 audio files for each class, I selected 10 random classes and gathered 20 files from those classes (10*20) + 114 audio files generated from Youtube videos. The total dataset contains 314 audio files.
Tested AudioVec Operator on the dataset, reduced dimensions using t-SNE, and visualized using plotly. Also, tested the original dataset (all 2000 files) using AudioVec Operator.
Tried out OpenL3 embedding model.

Google Colab Link
Dataset

Aatman Vaidya · Answer 7 · Fri Jun 28 2024 20:16:43 GMT+0800 (China Standard Time)

play the audio from the plot in colab
try out more embedding models - AST
do a review of lit for any video based transformer models are trained on audio

Aatman Vaidya · Answer 8 · Fri Jun 28 2024 20:22:14 GMT+0800 (China Standard Time)

google experiment clustering code - https://github.com/googlecreativelab/aiexperiments-drum-machine

Snehil Shah · Answer 9 · Fri Jun 28 2024 20:58:45 GMT+0800 (China Standard Time)

Recently found this audio embedding model called VGGish pre-trained by google. ([1], [2], [3]). Sharing here FWIW..

Aatman Vaidya · Answer 10 · Sat Jun 29 2024 19:18:04 GMT+0800 (China Standard Time)

@Snehil-Shah thanks for sharing, looks interesting, @Chaithanya512 do try implementing it, I have found a direct PyTorch implementation of it - https://github.com/harritaylor/torchvggish/tree/f70241ba, you can download the model from PyTorch Hub. (this should hopefully simplify things for you)

I am also sharing some other audio embedding models I found
Note - this is just a dump of things I am finding, no pressure to try out everything, you can evaluate and take a call on what to try/ not try.

CLAP: Learning Audio Concepts From Natural Language Supervision - official code, Hugging Face, Microsoft python library.
AST: Audio Spectrogram Transformer - Hugging Face
AI4Bharat IndicWav2Vec - this model is specific for speech thought (but do read up more on this) - official code
HuBERT - Hugging Face
Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models - Hugging Face, sample code
L3-Net Deep Audio Embeddings - code - this looks built-on/ similar to VGGish that @Snehil-Shah shared.
COLA: Contrastive learning of general purpose audio representations - blog, code, pytorch code
SurfPerch - more for environment sounds from a cursory reading.

Let me know if I can help out in any other way!

Chaithanya kalyan · Answer 11 · Sat Jun 29 2024 20:00:23 GMT+0800 (China Standard Time)

sure, thank you for sharing the embedding models. I will look into these.

Aatman Vaidya · Answer 12 · Fri Jul 05 2024 20:13:42 GMT+0800 (China Standard Time)

Let's keep trying out more transformer based/ CNN ensemble based models
Evaluate clustering results
Do a review of lit for other clustering algorithms

Chaithanya kalyan · Answer 13 · Fri Jul 05 2024 23:26:39 GMT+0800 (China Standard Time)

Updated Notebook: https://colab.research.google.com/drive/1lBrWCyUsuCSTOEUUqDwfc6FzpQWO0ETt?usp=sharing

Implemented KMeans Clustering on AudioVec Operator embeddings.
Evaluated Audio Spectrogram Transformers (AST), VGGish, HuBERT.

Comments: AST performs better as of now, AudioVec Operator with KMeans also shows good results.