showlab / UniVTG

[ICCV2023] UniVTG: Towards Unified Video-Language Temporal Grounding

Home Page:https://arxiv.org/abs/2307.16715

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Questions on UniVTG

nqx12348 opened this issue · comments

Hi, congratulations on your great sucess! I have two questions about UniVTG:

  1. ActivityNet-Captions is one of the most commonly used datasets in video moment retrieval, but I don't find results on this dataset in the paper. Have you tested UniVTG on this dataset?
  2. I tried your online demo, and find that the model gives completely different predictions for two identical text inputs. Why is this happening?
    image

Thanks!

@nqx12348 , thanks for your interesting and asking! Both are valuable questions.

  1. For activitynet, one issue is that most baselines use the existed video features e.g., C3D; while in our unified co-training, we need to ensure all benchmarks use the same features (e.g., slowfast+clip), thus we need to extract activitynet by ourselves. During the activitynet downloading, we find most RGB video links are invalid and fail to access. Thus, we are unable to align the previous benchmarks setting i.e. #training sample / #testing sample;
    Similar issues happen in didemo, mad (cannot access videos) benchmarks. thus, we select Charades / NLQ / Tacos since we can fully access all the videos.

  2. Regarding the second question, thank you for reminding! I just discovered this problem and am trying to find the reason. and will update later.

@QinghongLin
In second problem,

It seems that forward() function in main_gradio.py should contain

model.eval()
just before
with torch.no_grad():
(may be @ 82L, main_gradio.py)

Hi, @jjihwann
Sorry for this stupid mistake, I have updated the correspond code in repo, thank you again!

Based on @jjihwann instruction, now the different predictions results have been solved. Thanks.

image

close since solve the problem, please open if have new issue.