Extracting Video Features
rayush7 opened this issue · comments
Thank for sharing the code for this amazing work.
Do you provide any script to directly extract ViFi-CLIP features for a given video, which can be used for any other downstream task?
Hi @rayush7,
Thank you for showing interest in our work.
Unfortunately we do not provide any script for the above mentioned tasks, but I think you can easily come up with the script. Please refer to the high level steps outlined below:
-
Firstly, arrange your video(s) in the format outlined in our data preparation section (DATASETS.md).
-
Next, you would need to change the forward function of ViFi-CLIP model (please check vificlip.py) to make it return only the video features. The code will look something like:
def forward(self, image):
tokenized_prompts = self.tokenized_prompts
logit_scale = self.logit_scale.exp()
prompts = self.prompt_learner()
# b = image.shape[0]
# Lets encode the video into required format
b, t, c, h, w = image.size()
# Remove the batch dimensions
image = image.reshape(-1, c, h, w)
# Now pass the image into CLIP visual encoder
image_features = self.image_encoder(image.type(self.dtype))
# Now again attach the batch dimensions
image_features = image_features.view(b, t, -1) # [B, T, 512]
# Now take the mean along the temporal direction
image_features = image_features.mean(dim=1, keepdim=False) # image features are now ready
video_features = {"features"; image_features}
# Now dump the features on the disk using pickle etc.
with open(save_path + '/video_features.pickle', 'wb') as handle:
pickle.dump(video_features, handle, protocol=pickle.HIGHEST_PROTOCOL)
You might need to save the corresponding metadata (video name or path etc) for each videos with the dictionary.
- Finally, you can run the inference command to generate the ViFi-CLIP features and then later utilize it for different downstream tasks.
I hope that would be helpful.
Please let us know if your query is resolved.
Thank you and kind regards.
Perfect thank you @muzairkhattak . The Image features after taking the mean along the temporal direction (video features) - is exactly what I was looking for.