Yui010206 / SeViLA

[NeurIPS 2023] Self-Chained Image-Language Model for Video Localization and Question Answering

Home Page:https://arxiv.org/abs/2305.06988

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

what is the meaning of frame_num and answer_num?

aixiaodewugege opened this issue · comments

Thanks for your brilliant work!

I can't find explanations about these two configuration : frame_num and answer_num . Could you please help me?

Thanks for your interest in our work! Here are explanations for those parameters

  • model.frame_num: num of selected keyframes

  • datasets.nextqa.vis_processor.train.n_frms: num of frames for selection

  • model. answer_num: num of multi-choice options (e.g. NeXT-QA has 5 options for each QA, STAR has 4 options for each QA)

Thanks for your relay!

I have tested a lot on your web demo. But I found the zero shot result is not very good on my dataset.

image

I find the model will always output option1. Any idea about what is the problem? I only have one GPU, is there any way that I can test it not on the web demo?

We have instructions for running the Gradio demo locally and running the evaluation in this repo.
SeViLA requires at least 12 GB of memory to load the model and run an inference with batch size 1.

Sorry for my wrong expression. I have made it run locally with Gradio. I mean does it support model.predict_answers() function like BLIP2 to do inference? So that I can test on a dataset and see the output.

Besides, could you please give me some help about how to use your sevila without setting options? Should I change the sevila.generate_demo to sevila.generate or sevila.predict_answers ?

Yes, you can check and use generate() function to test on multi-choice QA datasets.
For open-ended answer generation, you can input with only questions and decode the FlanT5 output check here.

The same question.when I feed into models in nextqa datasets.I always get option1 in response.
image