frostinassiky / gtad

The official implementation of G-TAD: Sub-Graph Localization for Temporal Action Detection

Home Page:https://www.deepgcns.org/app/g-tad

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Mismatch between loaded features and snippet indexes.

Phoenix1327 opened this issue · comments

Thanks for releasing code, but I found there may exist some bugs when loading features from the h5 file.
In line 208 and 209 of dataset.py, we can see that the features are loaded every 5 frames (self.video_skipframes=5 for thumos)
https://github.com/Frostinassiky/gtad/blob/f4677a2fd8fda0f990e0c05687b07eed24de5688/gtad_lib/dataset.py#L208
https://github.com/Frostinassiky/gtad/blob/f4677a2fd8fda0f990e0c05687b07eed24de5688/gtad_lib/dataset.py#L209
The start frame of the loaded sequence should be 0 (idx=0).

But in line 221, the snippet index starts from #start_snippet=3#.
https://github.com/Frostinassiky/gtad/blob/f4677a2fd8fda0f990e0c05687b07eed24de5688/gtad_lib/dataset.py#L221

Then, after calculating, we can find the anchor region related the first timestamp in the sequence will be [0.5, 5.5].
https://github.com/Frostinassiky/gtad/blob/f4677a2fd8fda0f990e0c05687b07eed24de5688/gtad_lib/dataset.py#L241
https://github.com/Frostinassiky/gtad/blob/f4677a2fd8fda0f990e0c05687b07eed24de5688/gtad_lib/dataset.py#L242
But when you calculate the start region and the end region related to the ground truth box, these seems no such shift along the temporal dimension:

https://github.com/Frostinassiky/gtad/blob/f4677a2fd8fda0f990e0c05687b07eed24de5688/gtad_lib/dataset.py#L137
https://github.com/Frostinassiky/gtad/blob/f4677a2fd8fda0f990e0c05687b07eed24de5688/gtad_lib/dataset.py#L138

Maybe line 104 and 105 give the correct anchor_xmin and anchor_xmax (the measurements are seconds here), but they are not utilized to calculate training labels.
https://github.com/Frostinassiky/gtad/blob/f4677a2fd8fda0f990e0c05687b07eed24de5688/gtad_lib/dataset.py#L104
https://github.com/Frostinassiky/gtad/blob/f4677a2fd8fda0f990e0c05687b07eed24de5688/gtad_lib/dataset.py#L105
In BSN's codes, there exists a #start_idx#, I guess the reason is the extracted features used by BSN are already sampled at interval of 5 frames and the selected frame starts at the 3-th frame.

commented

Hey @Phoenix1327 Thanks for pointing this out!

When I load the features, I only used the first frame to represent the 5-frame-segment.
It is not the optimal solution because, as you mentioned, the 3rd frame should be more representative, or the average.

From my personal experience, the feature are very similar as their temporal neighbours. The improvement might be marginal but still worth discussing!

I would like to apply your suggestion to load the third frames. Let's keep this issue open and update the new experiment here.

Hey @Phoenix1327 Thanks for pointing this out!

When I load the features, I only used the first frame to represent the 5-frame-segment.
It is not the optimal solution because, as you mentioned, the 3rd frame should be more representative, or the average.

From my personal experience, the feature are very similar as their temporal neighbours. The improvement might be marginal but still worth discussing!

I would like to apply your suggestion to load the third frames. Let's keep this issue open and update the new experiment here.

Sorry, I may not put across my idea properly.
In fact, I think using the first frame is good. The sampled frames will be like: [0, 5, 10, ...,].
However, the problem is the indexes of the sampled frames are uncorrected in line 221. The indexes in line 221 are [0+3, 5+3, 10+3, ..., ].
I suggest to change line 221 to:
df_snippet = [skip_videoframes * i for i in range(num_snippet)]

If not, when you calculate match scores for the timestamp t (t in [0, 5, 10, ...]) as in line 144:
https://github.com/Frostinassiky/gtad/blob/f4677a2fd8fda0f990e0c05687b07eed24de5688/gtad_lib/dataset.py#L144
If I don't make miscalculations, the anchor region for timestamp t in line 144 is [t+3-2.5, t+3+2.5]. You match this region to the ground truth start/end region. It will lead to the mismatch. I think the correct region for timestamp t is [t-2.5, t+2.5].

Or, you can load the third frames. And simply change line 208 and 209 to self.flow_val[video_name][start_snippet:-1:self.skip_videoframes,...],
self.rgb_val[video_name][start_snippet:-1:self.skip_videoframes,...]

commented

Hey @Phoenix1327 ,
Thanks again for your constructive feedback!
I already update the code based on your suggestion, and the model performance, mAP at tIoU 0.5, increases from 0.427 to 0.430.
Please feel free to re-open the issue if there are mismatching problems.