showlab / UniVTG

[ICCV2023] UniVTG: Towards Unified Video-Language Temporal Grounding

Home Page:https://arxiv.org/abs/2307.16715

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

About timestamp calculation in DatasetMR

simon-zy opened this issue · comments

model_inputs["timestamp"] = ( (torch.arange(0, ctx_l) + self.clip_len / 2) / ctx_l).unsqueeze(1).repeat(1, 2)

Greetings.
Shouldn't 0.5 rather than clip_len/2 be used here?
According to my understanding, we need to calculate the center timestamp of each clip here. And ctx_l is the number of clips in this video. So to get the center of each clip, shouldn't we always use 0.5 , since torch.arange(0, ctx_l) always represents a list of intergers like (0, N)?
Something like this:
model_inputs["timestamp"] = ( (torch.arange(0, ctx_l) + 0.5) / ctx_l).unsqueeze(1).repeat(1, 2)

@QinghongLin Hi,I'm really sorry to bother, but do u have any thoughts?

@simon-zy Sorry for late reply, I am busying with the DDL. I will check this and update you after Nov. 17 :), thank you!

@QinghongLin Sorry, but do you have time to check this out now?

@simon-zy Thanks for the heads up and follow up!
Your concerns are make-sense and I think the updated version is correct.
I agree that the ratio should be independent of clip_len.

In my implementation, with the learning scheme, the model is try to learn the difference based on the initial assign timestamp. Thus model will try to learn the start/end difference from the non-exactly mid-point.
But the final windows should be the valid, since we do not require the left and right difference is equal.
But an ideal implementation should be what you proposed. Thanks for pointing out this :)