tattle-made / feluda

A configurable engine for analysing multi-lingual and multi-modal content.

Home Page:https://tattle.co.in/products/feluda/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Clustering videos using vector-similarity

Snehil-Shah opened this issue · comments

Related to #81

Description

@dennyabrain I tried clustering around 300 videos (from this dataset) using algorithms from your experiment's repo.

Google colab notebook

I first used your approach of taking 5 frames of a video, extracting their features using the RESNET model and taking their average to generate the final embedding. And then using your approach of t-SNE reduction, plotted the thumbnails on a graph:

output (1)
Observations listed in the notebook

I will be doing some R&D on some other ways to extract features from videos and using different models in our current approach as well (like CLIP which I have used before).

I will be now be working on setting up feluda and studying how feluda operators work etc. Would appreciate some directions...

@Snehil-Shah This is very good progress :)

I have some general improvements/ideas about this project to share and then specific things about feluda.

General Improvements :

  1. Improving the frame sampling strategy - Our current naive impelmentation just picks 5 samples from the video but it would be nice to explore if this can be made more effective. So that we maybe pick "interesting" frames from the video. We can evaluate existing strategies from Shot Transition Detection techniques and pick something that gives us good results.
  2. Good luck on evaluating different models. I'd be curious on seeing how much of a difference that makes to our results.

Specific Feedback for Feluda

Largely the project is composed of the following components

  • Store : this is where you store metadata. we currently support elasticsearch, postgres
  • queue : used when dealing with large data or in a production setting
  • Operators : these are the individual processing units that take input (in our case media items - text, video, image, audio) and do some operation on it.
  • Workers : these usually are meant to be deployed in some cloud setting and allow you to use the previous components together. So usually if I wanted to deploy what you have done above in my app, I would configure my feluda worker to listen to a queue for new media items, when a new item comes, i would use a feluda operator (lets say something that generates vector embedding using the ResNet model) and store it in elasticsearch.
    @aatmanvaidya has diligently documented 3 of our currently maintained workers here - https://github.com/tattle-made/feluda/wiki/Workers which might give you a better idea of how they are used.

Generally go through the wiki to learn more. I think this should be pretty useful to setup feluda locally - https://github.com/tattle-made/feluda/wiki/Setup-Feluda-Locally

A quick note about operators. So far our operators work on individual items. But for this project we might be for the first time figuring out how to make operators that work on collections. So that part would be novel and feel free to think about how you'd solve it.

Hi @Snehil-Shah , great work!!

Even I wanted to ask one thing,
I see that you have used the code Feluda's Video Vector Operator to select the 5 key frames, so thats great!

  1. I understand you have used the code from the t-SNE notebook, but just wanted to know what happens when we don't reduce 512 dimension vector representation to a 2 dimension, instead say we reduce it to 5-7 dimensions, can you investigate into this and see what are some more ways to go about it? How to improve results by such hyperparameter tuning? (I don't have great expertise, hence just asking this question out of curiosity)
tsne_embeddings = TSNE(n_components=2, learning_rate=150, perplexity=20, angle=0.2, verbose=2).fit_transform(X)
  1. I read in the observations section that the videos are getting clustered into some groups and some anomalies are also present. wanted to ask if the videos being clustered in the label's mentioned in the dataset.

@aatmanvaidya

  1. To my knowledge, we can't easily visualize a 5-7 dimension vector space (let alone 512). To visually see the different groups and clusters in the data, we have to plot them in a 2D graph (X-Y space) and hence we reduce it to 2 dimensions. Although we can visualize a 3D space and can make a 3D plot too, but in general, it becomes harder to plot, visualize and interpret high-dimensional vectors in a vector space.
    But such dimensionality reduction also results in data loss as the distances between the vectors (both absolute and cosine) is not necessarily preserved in the process. Hence, for tasks like say, recommendation systems, or finding most similar results for an input, we would use all 512 dimensions of the vectors to find the nearest neighbors and would use dimensionality reduction mostly for visualization.

  2. These are the labels present in the part of the dataset I used (as also mentioned in the notebook)

['ApplyLipstick',
 'ApplyEyeMakeup',
 'BalanceBeam',
 'Archery',
 'BenchPress',
 'Basketball',
 'BaseballPitch',
 'BabyCrawling',
 'BasketballDunk',
 'BandMarching']

Just from the visual interpretation of the graph: (as circled in the issue description)

  • Things like ApplyLipstick and ApplyEyeMakeup are pretty much clustered together. (they are pretty similar categories)
  • All BabyCrawling videos are clustered together.
  • Things like Indoor sports with an audience are clustered that include BalanceBeam and some videos from Basketball/BasketballDunk that are indoors.
  • Outdoor sports with a single person like BaseballPitch and Archery are clustered together.
  • All videos from Benchpress are clustered together.
  • All videos from BandMarching are clustered together.

Not all original labels are distinctly classified, but it still can classify similar videos together..

@aatmanvaidya I went ahead and clustered them into 10 labels (using K-Means) and tried to print a visualization by shading each image label differently (updated the notebook). This is how it went:

output_labelled

The coordinates of the images is different from before as the random_state wasn't set when running t-SNE

I wanted to add to the conversation you are having about dimensions that it's consistent with how we have done this in the past. For search we use multi dimensional vectors. The dimension reduction is strictly used for visualization. When we did the first iteration, we chose 2D for tsne simply coz it was easier to render on a 2D canvas on web. We can try 3D if we have the time I guess.

okay understood the point about dimensionality reduction - if its strictly for visualization then its fine, meaning 2D is fine.

@Snehil-Shah I looked at the updated notebook code, things look good to me for now

  • its great that majority of the videos from the dataset are being clustered into their respective labels. the k-means code also looks good

@Snehil-Shah Not sure if you have made up your mind between Uli and Feluda. Do consider submitting a proposal for clustering videos project in Feluda. I think you'll appreciate the complexity in this one and we could also use some focussed in depth exploration on this problem as part of this project.

also Do consider joining our slack. As we are nearing the proposal submission deadline, it could be handy for solving any doubts. https://admin417477.typeform.com/to/nVuNyG?typeform-source=tattle.co.in

@dennyabrain I tried doing that, but it says it requires a tattle email address or an invitation

image

@dennyabrain I am definitely inclined towards Feluda, it feels more challenging and will be a great learning experience. On a side note, I was thinking of submitting all three of my proposals to Tattle's projects, just so there is some flexibility on your end. Is that alright?

@Snehil-Shah that works for me. I think hopefully the two issues of audio and video have a lot of commonality and less work for you :)

regarding slack, please share your email. I'll send an invite.

closing this issue because the DMP program has started.