clustering of audio data using feluda's audio-cnn operator

Question

clustering of audio data using feluda's audio-cnn operator

Chaithanya512 opened this issue 2 months ago · comments

Chaithanya kalyan commented 2 months ago

Related to #82

check out this notebook. I tried clustering of audio data (indic languages) using audio-cnn operator of feluda and also implementation of basic autoencoder architecture for the same. The dataset consists of audio clippings, which are processed and converted from youtube videos using these scripts. The scripts take the youtube video and give the processed .wav file by breaking the entire video into small clips. To represent real-world social media data, I have taken videos from different indic languages such as assamese, hindi, telugu, kannada are taken and also various themes like education, entertainment, motivation, politics, horoscope/devotion, business are represented in the dataset.

dataset: https://drive.google.com/drive/folders/1kzdQrvNs0cG-9wuKv17m41xdXujoaHWh?usp=sharing

thumbnails: https://drive.google.com/drive/folders/1M8HEmZ654kTdmpcY54g27yg-tEkhz9Pr?usp=sharing

The thumbnails folder also contains the trained autoencoder model.

above is the image of tsne clustering of audio data using audio-cnn operator

above is the image of tsne clustering of audio data using autoencoder

I would be happy to hear any feedback and suggestions for improvements. Furthermore, I will now be doing some research on different approaches to obtain transcriptions of the audio as the need for them is high for efficient and thematic clustering. I have found indic subtitler as a good starting point. I would appreciate any directions.

Aatman Vaidya · Answer 1 · Mon May 06 2024 22:20:56 GMT+0800 (China Standard Time)

Hi @Chaithanya512

This is great work, really good to know that using the audio-cnn-model and t-SNE embeddings you were able to cluster some audio files with good results.

Great effort with implementing the autoencoder, it was neat work. Here are my thoughts on why audio clustering was poor using autoencoder

I read your comment in the notebook, you are right, there could be several reasons like insufficient training data, autoencoder structure etc.
- But I feel a major reason could be that audio files are not generally converted to tensors and trained on a CNN, instead an audio file is converted to a spectrogram, then the images of these spectrograms are trained on CNN's. Lot of hyperparameters like audio sampling rates, selecting key frames in an audio, adding noise to your training help improve the classification results.
Definitely do look into some ways for efficient and thematic clustering, as that would be a great learning for me too.

I also wanted to say, do consider submitting a proposal for Clustering large amount of audio. I think you'll appreciate the complexity in this one and we could also use some focussed in depth exploration on this problem as part of this project.

If you have any questions for us do consider joining our slack - https://admin417477.typeform.com/to/nVuNyG?typeform-source=tattle.co.in

Chaithanya kalyan · Answer 2 · Wed May 08 2024 22:26:24 GMT+0800 (China Standard Time)

hii @aatmanvaidya

Sorry for the late response as I was traveling and also thank you for the feedback. You were correct, converting the audio to spectrograms and then clustering showed better results even with an autoencoder network. check the updated notebook. I have modified the feluda's audio-cnn operator to integrate autoencoder architecture. The results were better than operating directly on audio using an autoencoder, but not good than feluda's operator though. Also, Can you please send the invitation to join the Slack channel? I would be happy to discuss further approaches and clear some of my doubts.

above is the tsne clustering of audio data using autoencoders and spectrograms

Gmail: chay5522kalyan@gmail.com

Aatman Vaidya · Answer 3 · Thu May 09 2024 00:54:19 GMT+0800 (China Standard Time)

Hi @Chaithanya512

I looked at the updated notebook, things look neat

I think to improve the results with Autoencoder, we will have to now further fine tune many hyperparameters, possible make some changes to the structure to improve the results! We should also explore other approached apart from Autoencoder that can give us comparable results in clustering.

I have sent you the invite link to join our Slack via email.