tattle-made / feluda

A configurable engine for analysing multi-lingual and multi-modal content.

Home Page:https://tattle.co.in/products/feluda/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Evaluate similarity or clustering algorithms for audio files

aatmanvaidya opened this issue · comments

  • Improve RAM Usage for the operator. see this issue for reference - #321
  • Evaluate similarity or clustering algorithms for audio
  • Evaluate sampling strategy help improve performance.

assigning this issue to @Chaithanya512

  • try out k-means clustering on Audio Data using Feluda's AudioVec Embedding operator
  • Run k-means clustering on CLAP, AST, CED and Feluda AudioVec
  • Profile CLAP, AST for RAM and CPU
  • read and implement sampling strategies.
  • start adding more and

notes on sampling strategies/ pre-processing for audio

Will keep adding more things to this comment!

Hi,

Thank you for the notes. The dedault CLAP (their best) uses HTS-AT model and the AST model, both takes mel-spectrograms as input. However, CLAP has an interesting way of handling variable audio length:

we train and inference on different lengths of audio inputs in constant computation time by combining both coarsely global and randomly sampled local information. For an audio in T seconds and a fixed chunk duration d = 10 seconds:
• T ≤ d: we first repeat the input, then pad it with zero values. For example, a 3-second input will be repeated as 3 × 3 = 9-second and padded with 1-second zero values.
• T > d: we first downsample the input from T to d-second as a global input. Then we randomly slice three d-second clips, respectively from the front 1/3, middle 1/3 and back 1/3 of the input, as local inputs.  

These local and global inputs are then passed down Conv layers and followed by fusion methodology.

We can discuss further about all the shortlisted models once I profile them. Also, can you please elaborate on 'extracting key features instead of key frames'. You mean like feature based sampling(?) where we extract certain features and put a threshold to select the audio frame or simply extracting the features?.

yes I meant feature based sampling - just extracting the features of an audio file. But we will have to investigate into whether the models can then extract embeddings from them.

AudioVec Operator

Duration Time Taken Memory Used
5 Sec 0.140 Sec 796.3 MiB
10 Sec 0.227 Sec 796.3 MiB
30 Sec 0.587 Sec 796.3 MiB
1 Min 1.103 Sec 796.3 MiB
2 Min 2.198 Sec 854.7 MiB
5 Min 5.273 Sec 1.2 GiB
10 Min 10.349 Sec 1.9 GiB
30 Min 32.839 Sec 4.8 GiB
1 Hour 68.933 Sec 9.0 GiB
2 Hour - -

Contrastive Language Audio Pretraining (CLAP)

Duration Time Taken Memory Used
5 Sec 0.262 Sec 2.7 GiB
10 Sec 0.264 Sec 2.7 GiB
30 Sec 0.265 Sec 2.7 GiB
1 Min 0.265 Sec 2.7 GiB
2 Min 0.268 Sec 2.7 GiB
5 Min 0.258 Sec 2.7 GiB
10 Min 0.276 Sec 2.7 GiB
30 Min 0.476 Sec 2.7 GiB
1 Hour 0.960 Sec 3.8 GiB
2 Hour 2.511 Sec 6.7 GiB

Audio Spectrogram Transformer (AST)

Duration Time Taken Memory Used
5 Sec 1.177 Sec 1.1 GiB
10 Sec 1.174 Sec 1.1 GiB
30 Sec 1.166 Sec 1.1 GiB
1 Min 1.175 Sec 1.1 GiB
2 Min 1.158 Sec 1.1 GiB
5 Min 1.137 Sec 1.1 GiB
10 Min 1.167 Sec 1.1 GiB
30 Min 1.204 Sec 2.1 GiB
1 Hour 1.211 Sec 3.4 GiB
2 Hour 2.129 Sec 6.1 GiB

Consistent Ensemble Distillation (CED)

Duration Time Taken Memory Used Time Taken(after Sampling frames) Memory Used (after sampling frames)
5 Sec 0.935 Sec 324.2 MiB 0.028 Sec 303.8 MiB
10 Sec 0.937 Sec 380.9 MiB 0.053 Sec 322.9 MiB
30 Sec 1.026 Sec 391.5 MiB 0.174 Sec 324.7 MiB
1 Min 0.941 Sec 412.4 MiB 0.181 Sec 326.6 MiB
2 Min 0.940 Sec 459.5 MiB 0.165 Sec 330.1 MiB
5 Min 1.635 Sec 597.7 MiB 0.171 Sec 341.3 MiB
10 Min 3.315 Sec 746.4 MiB 0.192 Sec 359.4 MiB
30 Min 10.011 Sec 1.4 GiB 0.234 Sec 432.6 MiB
1 Hour 19.411 Sec 3.8 GiB 0.300 Sec 542.6 MiB
2 Hour 38.927 Sec 6.7 GiB 0.445 Sec 762.2 MiB