Evaluate similarity or clustering algorithms for audio files

Question

Evaluate similarity or clustering algorithms for audio files

aatmanvaidya opened this issue 2 months ago · comments

Aatman Vaidya commented 2 months ago

Improve RAM Usage for the operator. see this issue for reference - #321
Evaluate similarity or clustering algorithms for audio
Evaluate sampling strategy help improve performance.

Aatman Vaidya · Answer 1 · Fri Jun 14 2024 20:44:19 GMT+0800 (China Standard Time)

assigning this issue to @Chaithanya512

Aatman Vaidya · Answer 2 · Fri Jun 28 2024 20:20:09 GMT+0800 (China Standard Time)

try out k-means clustering on Audio Data using Feluda's AudioVec Embedding operator

Aatman Vaidya · Answer 3 · Fri Jul 26 2024 20:13:46 GMT+0800 (China Standard Time)

Run k-means clustering on CLAP, AST, CED and Feluda AudioVec
Profile CLAP, AST for RAM and CPU
read and implement sampling strategies.
start adding more and

Aatman Vaidya · Answer 4 · Mon Jul 29 2024 01:36:33 GMT+0800 (China Standard Time)

notes on sampling strategies/ pre-processing for audio

Instead of thinking like what key frames of an audio file to extract, we have to think what are some key audio features that could be extracted.
- features like - MFCC (Mel-Frequency Cepstral Coefficients), Zero-crossing rate, Energy, Spectral roll-off, Spectral flux, Chroma features - here is a good read explaining these features briefly.
lets look into what all pre-processing does CLAP or AST do? like converting to mel-range or spectrogram etc.
lets look into - Dynamic time warping - it is used for measuring similarity between two temporal sequences. I have seen it being widely used in the context of time series, lets see if they work for audio waveforms.
- Low-resource speech recognition with Dynamic Time Warping
@Chaithanya512 one other I would also suggest is lets read papers around automatic speech recognition (ASR), music classification, audio tagging and generation to better understand what are some pre-processing methods. These papers might not be exactly useful for our use case, but would give us great insight into some pre-processing methods for audio.
some basic reads on audio processing

Will keep adding more things to this comment!

Chaithanya kalyan · Answer 5 · Mon Jul 29 2024 15:33:10 GMT+0800 (China Standard Time)

Hi,

Thank you for the notes. The dedault CLAP (their best) uses HTS-AT model and the AST model, both takes mel-spectrograms as input. However, CLAP has an interesting way of handling variable audio length:

we train and inference on different lengths of audio inputs in constant computation time by combining both coarsely global and randomly sampled local information. For an audio in T seconds and a fixed chunk duration d = 10 seconds:
• T ≤ d: we first repeat the input, then pad it with zero values. For example, a 3-second input will be repeated as 3 × 3 = 9-second and padded with 1-second zero values.
• T > d: we first downsample the input from T to d-second as a global input. Then we randomly slice three d-second clips, respectively from the front 1/3, middle 1/3 and back 1/3 of the input, as local inputs.

These local and global inputs are then passed down Conv layers and followed by fusion methodology.

We can discuss further about all the shortlisted models once I profile them. Also, can you please elaborate on 'extracting key features instead of key frames'. You mean like feature based sampling(?) where we extract certain features and put a threshold to select the audio frame or simply extracting the features?.

Aatman Vaidya · Answer 6 · Mon Jul 29 2024 17:26:44 GMT+0800 (China Standard Time)

yes I meant feature based sampling - just extracting the features of an audio file. But we will have to investigate into whether the models can then extract embeddings from them.

Chaithanya kalyan · Answer 7 · Fri Aug 02 2024 21:35:32 GMT+0800 (China Standard Time)

AudioVec Operator

Duration	Time Taken	Memory Used
5 Sec	0.140 Sec	796.3 MiB
10 Sec	0.227 Sec	796.3 MiB
30 Sec	0.587 Sec	796.3 MiB
1 Min	1.103 Sec	796.3 MiB
2 Min	2.198 Sec	854.7 MiB
5 Min	5.273 Sec	1.2 GiB
10 Min	10.349 Sec	1.9 GiB
30 Min	32.839 Sec	4.8 GiB
1 Hour	68.933 Sec	9.0 GiB
2 Hour	-	-

Contrastive Language Audio Pretraining (CLAP)

Duration	Time Taken	Memory Used
5 Sec	0.262 Sec	2.7 GiB
10 Sec	0.264 Sec	2.7 GiB
30 Sec	0.265 Sec	2.7 GiB
1 Min	0.265 Sec	2.7 GiB
2 Min	0.268 Sec	2.7 GiB
5 Min	0.258 Sec	2.7 GiB
10 Min	0.276 Sec	2.7 GiB
30 Min	0.476 Sec	2.7 GiB
1 Hour	0.960 Sec	3.8 GiB
2 Hour	2.511 Sec	6.7 GiB

Audio Spectrogram Transformer (AST)

Duration	Time Taken	Memory Used
5 Sec	1.177 Sec	1.1 GiB
10 Sec	1.174 Sec	1.1 GiB
30 Sec	1.166 Sec	1.1 GiB
1 Min	1.175 Sec	1.1 GiB
2 Min	1.158 Sec	1.1 GiB
5 Min	1.137 Sec	1.1 GiB
10 Min	1.167 Sec	1.1 GiB
30 Min	1.204 Sec	2.1 GiB
1 Hour	1.211 Sec	3.4 GiB
2 Hour	2.129 Sec	6.1 GiB

Consistent Ensemble Distillation (CED)

Duration	Time Taken	Memory Used	Time Taken(after Sampling frames)	Memory Used (after sampling frames)
5 Sec	0.935 Sec	324.2 MiB	0.028 Sec	303.8 MiB
10 Sec	0.937 Sec	380.9 MiB	0.053 Sec	322.9 MiB
30 Sec	1.026 Sec	391.5 MiB	0.174 Sec	324.7 MiB
1 Min	0.941 Sec	412.4 MiB	0.181 Sec	326.6 MiB
2 Min	0.940 Sec	459.5 MiB	0.165 Sec	330.1 MiB
5 Min	1.635 Sec	597.7 MiB	0.171 Sec	341.3 MiB
10 Min	3.315 Sec	746.4 MiB	0.192 Sec	359.4 MiB
30 Min	10.011 Sec	1.4 GiB	0.234 Sec	432.6 MiB
1 Hour	19.411 Sec	3.8 GiB	0.300 Sec	542.6 MiB
2 Hour	38.927 Sec	6.7 GiB	0.445 Sec	762.2 MiB