Extracting Video Features

Question

Extracting Video Features

rayush7 opened this issue a year ago · comments

Thank for sharing the code for this amazing work.

Do you provide any script to directly extract ViFi-CLIP features for a given video, which can be used for any other downstream task?

Muhammad Uzair Khattak · Answer 1 · Sun May 28 2023 20:24:49 GMT+0800 (China Standard Time)

Hi @rayush7,

Thank you for showing interest in our work.

Unfortunately we do not provide any script for the above mentioned tasks, but I think you can easily come up with the script. Please refer to the high level steps outlined below:

Firstly, arrange your video(s) in the format outlined in our data preparation section (DATASETS.md).
Next, you would need to change the forward function of ViFi-CLIP model (please check vificlip.py) to make it return only the video features. The code will look something like:

 def forward(self, image):
        tokenized_prompts = self.tokenized_prompts
        logit_scale = self.logit_scale.exp()
        prompts = self.prompt_learner()

        # b = image.shape[0]
        # Lets encode the video into required format
        b, t, c, h, w = image.size()
        # Remove the batch dimensions
        image = image.reshape(-1, c, h, w)
        # Now pass the image into CLIP visual encoder
        image_features = self.image_encoder(image.type(self.dtype))
        # Now again attach the batch dimensions
        image_features = image_features.view(b, t, -1)  # [B, T, 512]
        # Now take the mean along the temporal direction
        image_features = image_features.mean(dim=1, keepdim=False)  # image features are now ready

        video_features = {"features"; image_features}
        # Now dump the features on the disk using pickle etc.

        with open(save_path + '/video_features.pickle', 'wb') as handle:
             pickle.dump(video_features, handle, protocol=pickle.HIGHEST_PROTOCOL)

You might need to save the corresponding metadata (video name or path etc) for each videos with the dictionary.

Finally, you can run the inference command to generate the ViFi-CLIP features and then later utilize it for different downstream tasks.
I hope that would be helpful.

Please let us know if your query is resolved.
Thank you and kind regards.

Ayush Rai · Answer 2 · Sun May 28 2023 22:05:22 GMT+0800 (China Standard Time)

Perfect thank you @muzairkhattak . The Image features after taking the mean along the temporal direction (video features) - is exactly what I was looking for.