muzairkhattak / ViFi-CLIP

[CVPR 2023] Official repository of paper titled "Fine-tuned CLIP models are efficient video learners".

Home Page:https://muzairkhattak.github.io/ViFi-CLIP/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Extracting Video Features

rayush7 opened this issue · comments

Hi @muzairkhattak

Thank for sharing the code for this amazing work.

Do you provide any script to directly extract ViFi-CLIP features for a given video, which can be used for any other downstream task?

Hi @rayush7,

Thank you for showing interest in our work.

Unfortunately we do not provide any script for the above mentioned tasks, but I think you can easily come up with the script. Please refer to the high level steps outlined below:

  1. Firstly, arrange your video(s) in the format outlined in our data preparation section (DATASETS.md).

  2. Next, you would need to change the forward function of ViFi-CLIP model (please check vificlip.py) to make it return only the video features. The code will look something like:

 def forward(self, image):
        tokenized_prompts = self.tokenized_prompts
        logit_scale = self.logit_scale.exp()
        prompts = self.prompt_learner()

        # b = image.shape[0]
        # Lets encode the video into required format
        b, t, c, h, w = image.size()
        # Remove the batch dimensions
        image = image.reshape(-1, c, h, w)
        # Now pass the image into CLIP visual encoder
        image_features = self.image_encoder(image.type(self.dtype))
        # Now again attach the batch dimensions
        image_features = image_features.view(b, t, -1)  # [B, T, 512]
        # Now take the mean along the temporal direction
        image_features = image_features.mean(dim=1, keepdim=False)  # image features are now ready

        video_features = {"features"; image_features}
        # Now dump the features on the disk using pickle etc.

        with open(save_path + '/video_features.pickle', 'wb') as handle:
             pickle.dump(video_features, handle, protocol=pickle.HIGHEST_PROTOCOL)

You might need to save the corresponding metadata (video name or path etc) for each videos with the dictionary.

  1. Finally, you can run the inference command to generate the ViFi-CLIP features and then later utilize it for different downstream tasks.
    I hope that would be helpful.

Please let us know if your query is resolved.
Thank you and kind regards.

Perfect thank you @muzairkhattak . The Image features after taking the mean along the temporal direction (video features) - is exactly what I was looking for.