tattle-made / feluda

A configurable engine for analysing multi-lingual and multi-modal content.

Home Page:https://tattle.co.in/products/feluda/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Use AudioVec Operator for Clustering Audio

aatmanvaidya opened this issue · comments

  • Setup Feluda
  • Convert video files to audio
  • Try out Feluda AudioVec (or something you like) and use t-SNE (or other approaches) to evaluate clustering visually
    • can do this in a jupyter notebook and show results
  • Look at other pre-trained models to convert audio files to embeddings.

assigning this issue to @Chaithanya512

@Chaithanya512 can you comment on the issue so we can assign it to you?

Hi, thank you for the information.

  • Setup Feluda and run tests for AudioVecEmbedding Operator
  • Collect a dataset of 150-200 Audio Files
  • Run Feluda AudioVec Operator on a the dataset, reduce dimensions using t-SNE and do a visual plot - This will act as a baseline for us
  • Try out different Embedding Models

Hi,

Status:

  • Had set-up feluda and ran tests for AudioVec Operator.
  • Curated a dataset by selecting 200 audio files from the environmental dataset. The original dataset contains 50 classes and 40 audio files for each class, I selected 10 random classes and gathered 20 files from those classes (10*20) + 114 audio files generated from Youtube videos. The total dataset contains 314 audio files.
  • Tested AudioVec Operator on the dataset, reduced dimensions using t-SNE, and visualized using plotly. Also, tested the original dataset (all 2000 files) using AudioVec Operator.
  • Tried out OpenL3 embedding model.

Google Colab Link
Dataset

  • play the audio from the plot in colab
  • try out more embedding models - AST
  • do a review of lit for any video based transformer models are trained on audio

Recently found this audio embedding model called VGGish pre-trained by google. ([1], [2], [3]). Sharing here FWIW..

@Snehil-Shah thanks for sharing, looks interesting, @Chaithanya512 do try implementing it, I have found a direct PyTorch implementation of it - https://github.com/harritaylor/torchvggish/tree/f70241ba, you can download the model from PyTorch Hub. (this should hopefully simplify things for you)

I am also sharing some other audio embedding models I found
Note - this is just a dump of things I am finding, no pressure to try out everything, you can evaluate and take a call on what to try/ not try.

  1. CLAP: Learning Audio Concepts From Natural Language Supervision - official code, Hugging Face, Microsoft python library.
  2. AST: Audio Spectrogram Transformer - Hugging Face
  3. AI4Bharat IndicWav2Vec - this model is specific for speech thought (but do read up more on this) - official code
  4. HuBERT - Hugging Face
  5. Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models - Hugging Face, sample code
  6. L3-Net Deep Audio Embeddings - code - this looks built-on/ similar to VGGish that @Snehil-Shah shared.
  7. COLA: Contrastive learning of general purpose audio representations - blog, code, pytorch code
  8. SurfPerch - more for environment sounds from a cursory reading.

Let me know if I can help out in any other way!

sure, thank you for sharing the embedding models. I will look into these.

  • Let's keep trying out more transformer based/ CNN ensemble based models
  • Evaluate clustering results
  • Do a review of lit for other clustering algorithms

Updated Notebook: https://colab.research.google.com/drive/1lBrWCyUsuCSTOEUUqDwfc6FzpQWO0ETt?usp=sharing

  • Implemented KMeans Clustering on AudioVec Operator embeddings.
  • Evaluated Audio Spectrogram Transformers (AST), VGGish, HuBERT.

Comments: AST performs better as of now, AudioVec Operator with KMeans also shows good results.