[DMP 2024]: Clustering large amount of videos

Question

[DMP 2024]: Clustering large amount of videos

dennyabrain opened this issue 5 months ago · comments

Ticket Contents

Description

Feluda allows researchers, factcheckers and journalists to explore and analyze large quantity of multimeda content. One important modality on Indian social media is video. The scope of this task is to explore various automated techniques suited for this task and after consultation with the team, implement an end to end workflow that can be used to surface visual or temporal trends in a large collection of videos.

Goals

Review Literature with our team and do research and prototyping to review state of the art ML and classical DSP techniques
Optimize the solution for consistent RAM and CPU usage (limit the spikes caused by variables like file size, video length etc) since it will need to scale up for million videos.
Integrate the solution into Feluda by creating a operator that adheres to Feluda operator's interface

Expected Outcome

Feluda's goal is to provide a simple CLI or scriptable interface for Analysing multimodal social media data. In that vein, all the work that you do should be executable and configurable via scripts and config files. The solution should look at feluda's architecture and its various components to identify best ways to enable this.
The solution should have a way to configure data source (database with file IDs or a S3 bucket with files), specify and implement the data processing pipeline and where the result will be stored. Our current implementation uses S3 and SQL database for data source and Elasticsearch for storing result but additional sources or stores can be added if apt for this project.

Acceptance Criteria

Regular Interactive Demos with the team using a public jupyter notebook pushed to our experiments repository
Working feluda operator with tests that can be run as an independent worker in the cloud to schedule processing jobs over a large dataset
Output Structured data that can be passed onto a UI service (web or mobile) for downstream use cases

Implementation Details

One way we have approached this is by using Vector Embeddings. We have done this to great success to surface visual trends in Images. We used ResNet model to generate vector embeddings and store them in elasticsearch. We also used t-sne to reduce the dimensions of the vector embeddings to then display them in a 2D visualization. It can be viewed here
A detailed report over feluda's usage in a project to analyze images can be read here
The relevant feluda operator can be studied here
The code for tsne is here
A prior study of various ways to get insights out of images has been documented here

Mockups/Wireframes

This is an interactive visualization of Image clustering done using Feluda.

Doing UI development or integrating with any UI software is not part of this project but it might help to see what sort of downstream applications we use Feluda for.

Product Name

Feluda

Organisation Name

Tattle

Domain

Open Source Library

Tech Skills Needed

Computer Vision, Docker, Machine Learning, Performance Improvement, Python

Mentor(s)

@dennyabrain @duggalsu

Category

Data Science, Machine Learning

Sayan Mandal · Answer 1 · Tue Apr 09 2024 03:55:04 GMT+0800 (China Standard Time)

Hey @dennyabrain I'm Sayan, am interested in contributing to the video analysis project! My skills in computer vision, machine learning, and Python are a great fit. I'm eager to explore video analysis using techniques like vector embeddings.

Proficient in Docker and performance optimization, I can ensure the solution scales efficiently. I value open-source development and look forward to contributing demos.

Is there a way you prefer for me to reach out? I'm looking forward to exploring how I can contribute.

Denny George · Answer 2 · Tue Apr 09 2024 11:44:31 GMT+0800 (China Standard Time)

Hi @Sayanjones we can use this issue to communicate approaches. If you start concretely implementing something, you can make a new issue specific to your approach and we can take the conversation there.

Rishav Aich · Answer 3 · Thu Apr 11 2024 06:42:12 GMT+0800 (China Standard Time)

Hi @dennyabrain

I'm Rishav Aich, pursuing my BTech in artificial intelligence and data science from IIT Jodhpur. Being a student of AI, I have done courses on deep learning, machine learning, and AI. I am proficient in C++, Python, and R programming languages. I have a strong background in development, more specifically, backend development. I have used Docker in various projects.

This project completely aligns with my skills. It would be great to contribute to this.

Please advise me on how to get started with the project.

AbhimanyuSamagra · Answer 4 · Fri Apr 12 2024 20:45:00 GMT+0800 (China Standard Time)

Do not ask process related questions about how to apply and who to contact in the above ticket. The only questions allowed are about technical aspects of the project itself. If you want help with the process, you can refer instructions listed on Unstop and any further queries can be taken up on our Discord channel titled DMP queries.

Aryan Kumar Baghel · Answer 5 · Tue Apr 16 2024 05:54:08 GMT+0800 (China Standard Time)

Hey @dennyabrain , This is Aryan from IIIT - Naya Raipur, I am currently persuing my B. Tech in DATA SCIENCE AND ARTIFICIAL INTELLIGENCE. I have good experience in deep learning , computer vision, and NLP. I've worked on several projects, such as self-driving cars using camera input. I am really excited to work on this project as I feel this is a perfect match for me. Also, I am going to learn Docker in the future.

Denny George · Answer 6 · Tue Apr 16 2024 18:48:30 GMT+0800 (China Standard Time)

Hi everyone,

Thank you for expressing interest in this issue. Depending on your interests and skills, you can take ANY ONE of the following approaches :

Look at the problem statement and propose your approach
Remember the main problem statement - Given a large number of video files, find a way to group identical and similar video files. This approach would be ideal for anyone who is interested in or studies ML and/or DSP. By thinking about the problem statement, reviewing existing literature on it and proposing your approach here, we would all learn something from it and the mentors should be able to nudge you in the right direction.
Try getting feluda working on your machine
Feluda is a moderately complex software and has many moving parts. Getting it working on your machine itself can be a challenge. We have a guide on it here. If you are is a software developer/tinkerer, this might be a good place to start because once you have Feluda working locally and you can see the various existing functionalities, that might give you an idea of how to proceed.
Recreate our code on a jupyter notebook or google collab notebook
We already have some code that takes video files and converts them into vectors. We also have code that takes these vectors and clusters them. I would take this approach if you are a software engineer with some ML engineering skills and you know your way around using ML models. Once you get this working on your notebook we can try out different pretrained models to evaluate performance.

You'll have me or members from our team to guide if you get stuck on any of these approaches. Taking some conrete steps on any of these 3 steps would help us know what your interests and skills are and give you concrete feedback when you get stuck.

All the best!

AbhimanyuSamagra · Answer 7 · Tue Apr 23 2024 18:48:13 GMT+0800 (China Standard Time)

Do not ask process related questions about how to apply and who to contact in the above ticket. The only questions allowed are about technical aspects of the project itself. If you want help with the process, you can refer instructions listed on Unstop and any further queries can be taken up on our Discord channel titled DMP queries. Here's a Video Tutorial on how to submit a proposal for a project.

Aryan Kumar Baghel · Answer 8 · Fri Apr 26 2024 05:51:05 GMT+0800 (China Standard Time)

Hey @dennyabrain , i have some queries regarding the project :-

what will be the length of videos?
Is there any available dataset with pre-defined classes ?
A video is a combination of audio, images and texts. what should be the most important classification criteria out of these?
How many classes should be there for classification? please give some examples.

Aatman Vaidya · Answer 9 · Fri Apr 26 2024 11:19:01 GMT+0800 (China Standard Time)

Hey @dennyabrain , i have some queries regarding the project :-

what will be the length of videos?

Is there any available dataset with pre-defined classes ?

A video is a combination of audio, images and texts. what should be the most important classification criteria out of these?

How many classes should be there for classification? please give some examples.

Hi @Aryankb

Generally, expect the length to be anywhere between 30sec - 20mins.
Currently we don't have a dataset with pre-defined classes, but feel free to look for such datasets
To the best of my knowledge, a video is just a series of images, so to answer your question, the most important classification criteria would be image. Please investigate a bit deeper into this. Also take a look at the 3rd point in @dennyabrain comment. That is an example of clustering images using a certain type of embedding.
There is no specific number of classes, but think of classes as metadata to these videos in the context of social media. Some examples could be - memes, political, health, paper documents, news etc, these are very broad labels, you can think of some specific ones too.

Hope this helps

mithilesh thakkar · Answer 10 · Sat Apr 27 2024 15:38:59 GMT+0800 (China Standard Time)

Hey @dennyabrain, Mithilesh here, I have experience and passion for creating end-to-end, highly scalable computer vision pipelines, I am working with a young start-up as a machine learning engineer, I have led a similar project implementation for one of the largest edTech companies in the world, where we worked on clustering on a similar type of video(avg length of 10 mins) and then recommended video based on user mistake in the test, where we work with embedding creation and efficient search algorithm, apart from this I have lead creating and scaling of the computer vision based exam grading tool from 50 users to 4 million users with docker and AWS, and bring down the running time by 70% over three iterations and that help government organize world's largest AI graded examination. I am very eager to contribute in this project and make clustering of video more efficient and scale it fast.

Aryan Kumar Baghel · Answer 11 · Wed May 01 2024 22:15:12 GMT+0800 (China Standard Time)

Hey @dennyabrain , I am Aryan Kumar Baghel, from IIIT - NAYA RAIPUR
I was exploring the ways to extract unique frames from the video. I tried to extract unique keyframes from some videos using ffmpeg - to extract keyframes from the video , k-means- to extract unique keyframes from keyframes extracted by ffmpeg, and here are the results :-

(We can select one image from each cluster, as the representation of that cluster, then further we can use some image captioning models to generate small captions for each image. Next we can combine all captions to generate the final caption for the video or use them to classify the video accurately.)

Google Collab Notebook

Video Link : https://drive.google.com/file/d/1Qr08m4Bf0JjTszExDLoey2LCqcJjJl3n/view?usp=drive_link
Clusters :

Video 2 link : https://drive.google.com/file/d/1QnupjsK7ILQUYrqlPT2pTdTAzoy8Wi-C/view?usp=drive_link
Clusters :

I'll be now working on ways to cluster the images such that it selects the no. of clusters automatically, Please give your reviews and directions for the future work.

Aaradhya_Singh · Answer 12 · Sat May 04 2024 18:37:24 GMT+0800 (China Standard Time)

Hi everyone,

Thank you for expressing interest in this issue. Depending on your interests and skills, you can take ANY ONE of the following approaches :

1. Look at the problem statement and propose your approach
   Remember the main problem statement - Given a large number of video files, find a way to group identical and similar video files. This approach would be ideal for anyone who is interested in or studies ML and/or DSP. By thinking about the problem statement, reviewing existing literature on it and proposing your approach here, we would all learn something from it and the mentors should be able to nudge you in the right direction.

2. Try getting feluda working on your machine
   Feluda is a moderately complex software and has many moving parts. Getting it working on your machine itself can be a challenge. We have a guide on it [here](https://github.com/tattle-made/feluda/wiki/Setup-Feluda-Locally). If you are is a software developer/tinkerer, this might be a good place to start because once you have Feluda working locally and you can see the various existing functionalities, that might give you an idea of how to proceed.

3. Recreate our code on a jupyter notebook or google collab notebook
   We already have some code that takes [video files and converts them into vectors](https://github.com/tattle-made/feluda/blob/main/src/core/operators/vid_vec_rep_resnet.py). We also have code that takes these vectors and [clusters them](https://github.com/tattle-made/data-experiments/blob/master/tSNE-clustering.ipynb). I would take this approach if you are a software engineer with some ML engineering skills and you know your way around using ML models. Once you get this working on your notebook we can try out different pretrained models to evaluate performance.

You'll have me or members from our team to guide if you get stuck on any of these approaches. Taking some conrete steps on any of these 3 steps would help us know what your interests and skills are and give you concrete feedback when you get stuck.

All the best!

Hey @dennyabrain ,
I'm Aaradhya Singh , currently a 2nd year undergrad of computer science and engineering , proficcient in C/C++ , python , deep learning and machine learning and a researcher and learner for various upcoming technlogies and tech stacks...after reading at your suggested approches ......I might be able to fine tune some models to the efficiency which are mostly built upon CNN/RNN architectures and use pipeline/heirarchical approach to solve the complex problem of the classification or creating clusters of the content....looking forward to work on it and updating on my findings

Denny George · Answer 13 · Mon Jun 17 2024 11:52:17 GMT+0800 (China Standard Time)

@Snehil-Shah can you comment here, so I can assign the issue to you?

Snehil Shah · Answer 14 · Mon Jun 17 2024 12:07:13 GMT+0800 (China Standard Time)

@dennyabrain Yes.

Aatman Vaidya · Answer 15 · Mon Jun 17 2024 19:19:07 GMT+0800 (China Standard Time)

Weekly Goals

Week 1

create a mixed dataset of 150-200 datasets
Run Feluda Video Operator on a video dataset, reduce dimensions using t-SNE and do a visual plot - This will act as a baseline for us
Embeddings models - CliP, VideoMAE
Visual display it using t-SNE

Week 2

Try out more Video Models for Embedding
Try self-supervised clustering algorithms
Do a review of lit for video pre-processing. Read up on better sampling strategies.

Week 3

run video models on dataset and visualize using t-SNE, also do k-means clustering
implement sampling strategies for video.

Week 4

read and review sampling strategies
Explore - closer frames as input, input higher moving frames etc.
how to select better n key frames.

Week 5

Profile the CLIP embedding model.
keep experimenting with more sampling strategies.
collect dataset of low-quality videos in Indian Languages.

Snehil Shah · Answer 16 · Wed Jun 26 2024 03:20:47 GMT+0800 (China Standard Time)

Weekly Learnings & Updates

Week 1

Benchmarked popular image embedding models to extract semantic features from video frames and average them into a video vector.
Compared pre-trained models like ResNet18, CLIP-ViT-B-32, EfficientNet-B0 and DeiT-medium-16.
Neurally encoded around 100 videos from a combined UCF101 subset and a custom dataset of popular topics like memes, nature, and commentary.
Plotted t-SNE reduced vectors to evaluate clustering and visual distribution:
Clustered them using k-means and examined each cluster to evaluate clusters and spot outliers:

Week 2

Reviewed literature and ran experiments on video transformers and 3D neural net architectures.
Benchmarked various video embedding models to extract active features and capture frame interpolation.
Compared pre-trained models like I3D-R50, R3D-18, SlowFast-R50, VideoMAE, ViViT and X-CLIP.
Ran inference using the above tests by plotting t-SNE reduced vectors and individual clusters.
Achieved near zero outliers and true action recognition that was missing with image embedding models.
Successfully clustered 100 videos into 13 classes each correctly corresponding to the original classes, all with just one outlier.
Finalized a hybrid approach simulating a multi-stream pipeline to capture the static and active aspects of a video.
Notebook containing all benchmarks, experiments, inferences, and opinions mentioned till now for reference.

Week 3

Explored and implemented zero-shot classification with promising results utilizing the multi-modal nature of CLIP-based transformer models.
This meant being able to classify videos into newer classes without any fine-tuning and retraining of a new linear head. It works by relying on vector similarity between video embeddings and text embeddings (made from the class names) in a common vector space.

This would allow fact-checkers and researchers to surface visual trends based on labels such as "newspaper", "screenshots", "memes" etc. without any additional training overhead.
Benchmarked various clustering algorithms from scikit-learn on efficiency and results by plotting individual clusters.
Finalized k-means and agglomerative clustering if the number of clusters are known and Affinity Propagation for unknown number of clusters.
Notebook containing all benchmarks, experiments, inferences, and opinions on zero-shot classification and clustering algorithms.