tattle-made / feluda

A configurable engine for analysing multi-lingual and multi-modal content.

Home Page:https://tattle.co.in/products/feluda/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[DMP 2024]: Clustering large amount of audio

dennyabrain opened this issue · comments

Ticket Contents

Description

Feluda allows researchers, factcheckers and journalists to explore and analyze large quantity of multimeda content. One important modality on Indian social media is audio. The scope of this task is to explore various automated techniques suited for this grouping similar audio together and visualizing them. After consultation with the team, implement an end to end workflow that can be used to surface visual or temporal trends in a large collection of audio.

Goals

  • Review Literature with our team and do research and prototyping to review state of the art ML and classical DSP techniques
  • Optimize the solution for consistent RAM and CPU usage (limit the spikes caused by variables like file size, video length etc) since it will need to scale up for million videos.
  • Integrate the solution into Feluda by creating a operator that adheres to Feluda operator's interface

Expected Outcome

Feluda's goal is to provide a simple CLI or scriptable interface for Analysing multimodal social media data. In that vein, all the work that you do should be executable and configurable via scripts and config files. The solution should look at feluda's architecture and its various components to identify best ways to enable this.
The solution should have a way to configure data source (database with file IDs or a S3 bucket with files), specify and implement the data processing pipeline and where the result will be stored. Our current implementation uses S3 and SQL database for data source and Elasticsearch for storing result but additional sources or stores can be added if apt for this project.

Acceptance Criteria

  • Regular Interactive Demos with the team using a public jupyter notebook pushed to our experiments repository
  • Working feluda operator with tests that can be run as an independent worker in the cloud to schedule processing jobs over a large dataset
  • Output Structured data that can be passed onto a UI service (web or mobile) for downstream use cases

Implementation Details

One way we have approached this is by using Vector Embeddings. We have done this to great success to surface visual trends in Images. We used ResNet model to generate vector embeddings and store them in elasticsearch. We also used t-sne to reduce the dimensions of the vector embeddings to then display them in a 2D visualization. It can be viewed here
A detailed report over feluda's usage in a project to analyze images can be read here
The relevant feluda operator can be studied here
The code for tsne is here
A prior study of various ways to get insights out of images has been documented here

Mockups/Wireframes

This is an interactive visualization of Image clustering done using Feluda.
Screenshot 2024-02-16 at 08-16-56 Tattle - articles

Doing UI development or integrating with any UI software is not part of this project but it might help to see what sort of downstream applications we use Feluda for.

Product Name

Feluda

Organisation Name

Tattle

Domain

Open Source Library

Tech Skills Needed

Machine Learning, Python

Mentor(s)

@dennyabrain @duggalsu

Category

Data Science, Machine Learning, Research

Hi there, @dennyabrain , I'm passionate about machine learning and keen on joining this project.

Here's a bit about myself:
I am Madhukesh Singh, currently studying at the National Institute of Technology, Hamirpur, in my third year.

My experience includes working on image processing, computer vision, and object detection in satellite imagery during my internship as an AI developer at DRDO DYSL.AI.

Is there a preferred method for communicating with the mentors? I'm eager to contact you and explore how I can contribute.

Hi @MadhukeshSingh we can use this issue to communicate approaches. If you start concretely implementing something, you can make a new issue specific to your approach and we can take the conversation there.

"Hi there, @dennyabrain! I want to contribute to this project, but I am new to open-source contribution.
So, can you tell me what I have to do in this project and how to contribute?"

Hi there, @dennyabrain , I'm passionate about machine learning and keen on joining this project and for the project because of a robust skill set encompassing advanced machine learning and natural language processing capabilities.
my adaptability, efficiency in information retrieval, and quick learning make me a valuable asset for tasks requiring Machine Learning, AI-driven insights, data analysis, language language-related applications.
I am equipped to contribute to the team's goal by leveraging cutting-edge AI technology and staying abreast of industry trends.

Here's a bit about myself:
I am Manisha Sharma, currently studying at the Gd Goenka University, Gurugram, Haryana, 4th last year.

My experience includes working on deep learning, machine learning and artificial neural network, and artificial crypt analysis during my internship as an AI developer at Sag - DRDO and currently working in Interglobe Aviation as a data analyst internship.

Is there a preferred method for communicating with the mentors? I'm eager to contact you and explore how I can contribute.

Hi @dennyabrain,
I'm Pooja Singh, a software developer intern at Verana Networks, passionate about machine learning and eager to join this project. My robust skill set in advanced machine learning and natural language processing, coupled with my adaptability and efficiency in information retrieval, make me a valuable asset.

I have hands-on experience in machine learning, and artificial intelligence from my internship as well as current work. I'm keen to connect with mentors to explore how I can contribute to the team's goals.

What's the preferred method for communication? Looking forward to hearing from you.

Hello @dennyabrain ,
I'm thrilled to delve into the Feluda project and its objectives. After reviewing the documentation, I noticed that my background aligns well with the project's needs.

A little about myself: My name is Sreyash Layek, and I'm currently in my fifth year at the Indian Institute of Technology, Kharagpur, pursuing a Dual Degree (Integrated B.Tech & M.Tech) with a specialization in Signal Processing and Machine Learning.

Over the past three years, I've dedicated myself to exploring Machine Learning, with a particular focus on Computer Vision and Natural Language Processing tasks. I've spent a year working on Speech Processing and Accent Conversion, achieving results close to the state-of-the-art. Additionally, I've developed models for various applications, including Attention Monitoring, Accident Classification, Audio Classification, Emotion Classification, Recommendation Systems, and more.

I bring to the table over five years of experience in Python and three years in Machine Learning and Deep Learning. I'm eager to learn more about the project and discuss how I can contribute. I'd be interested in understanding your expectations and the specific requirements for this project.

Could we explore this further?

Hello @dennyabrain , My name is Surjeet bijarniya and I am a student of IIT bhu and passionate about machine learning and eager to join this project. But I am new in machine learning sir, tell me how I contribute

Hello @dennyabrain! I'm enthusiastic about machine learning and eager to be part of this project.

Allow me to introduce myself:
I'm Kamera Vamshi, currently I am Pursuing my B.Tech Final year at the National Institute of Technology, Rourkela (NIT Rourkela).

My background involves significant experience in Machine Learning, Python, and Data Analysis. I honed these skills during my internship and Projects.

Could you please advise on the preferred method for reaching out to mentors? I'm keen to connect and discuss how I can contribute to the project.

Hii @dennyabrain ,

I am Akanshu Aich, a third year BTech student from International Institute of Information Technology, Bhubaneswar. I am writing to express my interest in contributing to this project as a part of DMP 2024. Having thoroughly reviewed the project, I am impressed by its objectives and it seeks the potential for great impact in industries.

With my background in Backend using Django , MERN with practicing hands on Machine learning and DevOps such as Docker, I believe I can make valuable contributions to Machine learning part . My experience includes several projects like Society-Expenditure Manager using Django, Real Estate using MERN and Info-Finding Tool using Machine Learning(LLM), which I believe align well with the goals of your project.

I am particularly interested in fulfilling the requirements of the project and have some ideas on how to approach it effectively. I am committed to adhering to best practices, contributing high-quality code, and actively collaborating with the project maintainers and community.

I am excited about the opportunity to contribute to "Feluda" and help further its mission. I look forward to discussing potential contributions and how I can best support the project.

Please guide me with procedure and with all your knowledge and experience.

Hello @dennyabrain! I'm enthusiastic about machine learning and eager to be part of this project.

Allow me to introduce myself:
I'm Manav Solkar, currently I am Pursuing my B.Tech second year at Thakur College of Engineering and Technology (TCET).

I really want to be a part of this and hope that your guidance would help me to increase my skillset .

Could you please advise on the preferred method for reaching out to mentors? I'm keen to connect and discuss how I can contribute to the project

Hey @dennyabrain and @duggalsu,
I am interested to work on this project. I have prior experience working on project with similar objectives on the QAnon dataset. You can check out my work with the provided link.

notebook link: https://www.kaggle.com/code/tatwanshjaiswal/dark-web-language-analysis

I would be happy to receive feedback on how to improve it.

Do not ask process related questions about how to apply and who to contact in the above ticket. The only questions allowed are about technical aspects of the project itself. If you want help with the process, you can refer instructions listed on Unstop and any further queries can be taken up on our Discord channel titled DMP queries.

Hey @dennyabrain and @duggalsu, I am Ashutosh pursuing B.Tech. in Artificial Intelligence and Data Science from IIT Jodhpur. I am proficient in languages like Python and C++. I have worked on projects related to machine learning and deep learning such as Stock Price Prediction and Voice Controlled Music Recommendation System using Deep Learning.
I am interested to work on this project and apply my skills in the project.

Hi everyone,

Thank you for expressing interest in this issue. Depending on your interests and skills, you can take ANY ONE of the following approaches :

  1. Look at the problem statement and propose your approach
    Remember the main problem statement - Given a large number of audio files, find a way to group identical and similar audio files. This approach would be ideal for anyone who is interested in or studies ML and/or DSP. By thinking about the problem statement, reviewing existing literature on it and proposing your approach here, we would all learn something from it and the mentors should be able to nudge you in the right direction.

  2. Try getting feluda working on your machine
    Feluda is a moderately complex software and has many moving parts. Getting it working on your machine itself can be a challenge. We have a guide on it here. If you are is a software developer/tinkerer, this might be a good place to start because once you have Feluda working locally and you can see the various existing functionalities, that might give you an idea of how to proceed.

  3. Recreate our code on a jupyter notebook or google collab notebook
    We already have some code that takes audio files and converts them into vectors. We also have code that takes these vectors and clusters them. I would take this approach if you are a software engineer with some ML engineering skills and you know your way around using ML models. Once you get this working on your notebook we can try out different pretrained models to evaluate performance.

You'll have me or members from our team to guide if you get stuck on any of these approaches. Taking some conrete steps on any of these 3 steps would help us know what your interests and skills are and give you concrete feedback when you get stuck.

All the best!

Hello @dennyabrain I really want to contribute in this project. I have good hands on experience on python, Machine learning, Databases, Deep Learning. I am Data Science student and really enthusiast to work in your project.
From past 3 years, I have done a lot of real time projects, I have also done many internships to gain the hands on experience.
I want to learn and gain experience in deep way by working on this project. Please allow me to work with your project.

Hello @dennyabrain,

I'm eager to contribute to your project. With substantial experience in Python, machine learning, databases, and deep learning, I believe I can make valuable contributions. As a data science student, I've spent the past three years working on various real-world projects and completing internships to hone my skills.
I'm enthusiastic about delving deeper into the field and gaining practical experience through involvement in your project. I'm eager to learn and collaborate effectively. Please consider allowing me to be part of your team.

Do not ask process related questions about how to apply and who to contact in the above ticket. The only questions allowed are about technical aspects of the project itself. If you want help with the process, you can refer instructions listed on Unstop and any further queries can be taken up on our Discord channel titled DMP queries. Here's a Video Tutorial on how to submit a proposal for a project.

hi @dennyabrain,

I am Chaithanya Kalyan. I am interested in contributing to this project.

I have experience working with time series signals. As part of the PhysioNet 2023 challenge, time domain and frequency domain features were extracted to classify the EEG signals (more details here).

I have a doubt regarding the details of this project and would greatly appreciate the clarification:

  1. Does this clustering algorithm have to be scalable to different datasets (like a general framework that can be extended ) or is it only for a specific dataset?

I think the following approach will be worth trying:
without extracting the traditional audio features, we can train an autoencoder network on a large audio collection to automatically learn a low-level representation of the audio signals and cluster based on these latent representations.

I have tried a similar approach on EEG signals before, you can find that notebook here.

I would be happy to hear your feedback.

contact: chay5522kalyan@gmail.com

Hi @Chaithanya512,

Given that the project focus is on addressing usecases around online misinformation, the dataset we deal with is usually audio/video found on social media. So it can contain a variety of audio - memes, news clipping, amateur recording from phones etc.

Is there a quick way to validate if the autoencoder network approach would be suitable for this use case? What is your rationale to preferring that over extracting traditional audio features?

Thank you for the feedback, I am currently working on the code to validate the use of autoencoders.

Compared to traditional, hand-crafted features, autoencoders have the potential to capture a wider range of features. While traditional audio features are valuable, they might miss some subtle patterns in the data that autoencoders can discover.

I have a follow-up question (might be stupid) for your response, please correct me if I am wrong.

I'm curious, do you think traditional audio features are effective in clustering misinformation and not-misinformation? do those features vary for misinformation and not-misinformation?

So we wont be using the clusters to classify something as "misinformation" and "not misinformation". We're hoping to use clustering as a way to find first level of grouping amongst a large dataset. So most likely the clusters could be something high level like "memes", "amateur-smartphone" etc. If we are lucky we could aspire for thematic labels like "politics", "health" etc.

An example of clustering we did on images is here - https://tattle.co.in/articles/covid-whatsapp-public-groups/t-sne/
The clusters we got then were - Screenshots(Social Media), Screenshots(Other), Medical Supplies, Paper Documents, Religious Imagery etc

thank you for the clarification. That makes sense now. So, we are using clustering only to find the high-level labels/pseudo labels. I have found this paper that uses labeled data (only text) to categorize misinformation posters or active citizens on social media. It got me thinking - if we could obtain the transcriptions of the audio content (if that is possible), that information could significantly enhance our clustering efforts.

@Chaithanya512 yes that would certainly help. Infact when we do clustering for images, we often try to extract any text out of it as a way to get a richer dataset. You can certainly try transcriptions for audio content. One challenge might be that we are dealing with non English languages and also low quality audio.

hey can I work on this issue I have work on speech attenuation in the past so kind of familiar with problem statemnet indly let me know

Hey !! I Want to work on this

Hi there, @dennyabrain,
I am Ankita Mohan, I am a third-year student at Kalinga Institute of Industrial Technology, Odisha. I'm passionate about machine learning and keen on joining this project. Moreover, I have a deep understanding of clustering algorithms as I have done projects in clustering.
I am eager to contribute and to gain your guidance for the same.

I would definitely like to work on it ☺️

Hi all thanks for your enthusiasm. Please let me know if you have any specific ideas on how you would go about the project.

Please refer to this comment for some suggested ways to move forward #82 (comment)

Hi @dennyabrain ! I'm a third year student from Cummins Pune.

I'm thrilled to join your Clustering large amount of audio project and offer my skill sets which has a strong background in Machine Learning ,deep learning (CNN), NLP, DSP and Python, which seem to fit perfectly with what you're looking for.
I'm excited to explore how my expertise can elevate the project. Furthermore, the integration computer vision along with the ML advancements could lead to a seamlessly automated system.
I'm eager to discuss further avenues where I can make meaningful contributions. Could we schedule a meeting to delve into this in more detail?

My skills in machine learning (computer vision, NLP) and experience with speech processing align well with the Feluda project. I'm a motivated student with 3+ years of Python experience and 2 years in ML/DL. Eager to discuss how I can contribute!

Hi @dennyabrain , Myself V Dinesh Third Year Mechanical student from Army Institute of Technology Pune. I'm passionate about machine learning and keen on joining this project. In addition, my expertise in clustering algorithms extends to a profound level, acquired through hands-on experience gained from multiple projects focused specifically on implementing and fine-tuning various clustering techniques. These projects have provided me with a comprehensive understanding of the underlying principles, nuances, and practical applications of clustering algorithms across diverse domains, allowing me to effectively navigate through complex datasets, identify patterns, and extract meaningful insights. I am enthusiastic about contributing my expertise and am eager to receive your guidance in order to further enhance my capabilities in this regard.

Hi @dennyabrain . I am Deep Pandharkar, second year Data Science Engineering student from DJ Sanghvi College of Engineering Mumbai. I have a some experience in CV as well as NLP. My passion towards ML makes me keen towards joining this project. In addition to that, I have practised a lot of vector embeddings as a part of my NLP projects. I also have coding experience in Data Structures and Algorithms. Eager to discuss how can I contribue

I am Sufia, and I graduated with B.tech CSE, I am Data scientist and also full stack developer, but I am fresher I hv only completed 6 months of training in the entire field and one month of Internship so, I want to do the internship.

Weekly Goals

Week 1

  • Setup Feluda and run tests for AudioVecEmbedding Operator
  • Collect a dataset of 150-200 Audio Files
  • Run Feluda AudioVec Operator on a the dataset, reduce dimensions using t-SNE and do a visual plot - This will act as a baseline for us
  • Try out different Embedding Models

Week 2

  • play the audio from the plot in colab
  • try out more embedding models - AST
  • do a review of lit for any video based transformer models are trained on audio
  • try out k-means clustering on Audio Data using Feluda's AudioVec Embedding operator

Week 3

  • Let's keep trying out more transformer based/ CNN ensemble based models
  • Evaluate clustering results
  • Do a review of lit for other clustering algorithms

Week 4

  • Keep exploring embedding model's
  • start looking at sampling strategies for audio files.

Weekly Learnings and Updates:

Week 1:

  • Set-up feluda and ran tests for AudioVec Operator.

  • Collected a dataset of 314 audio files.

  • Established a baseline evaluation for the performance of AudioVec Operator on the dataset using a visual clustering plot.
    env_data_audiovec

  • Tried out OpenL3 embeddings model.
    env_data_openl3

Colab file: https://colab.research.google.com/drive/1lBrWCyUsuCSTOEUUqDwfc6FzpQWO0ETt?usp=sharing

Week 2: