TengdaHan / DPC

Video Representation Learning by Dense Predictive Coding. Tengda Han, Weidi Xie, Andrew Zisserman.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Not understanding why you take the last sequences and not the last samples of the sequences

wolhandlerdeb opened this issue · comments

in dpc/model_3d.py
feature_inf = feature_inf_all[:, N-self.pred_step::, :].contiguous()
N is supposed to be the number of sequences, don't we aim to predict the last samples of each sequence?

No. More than the last.
e.g. if the task is 5pred3, we take 8 steps as input. And this line of code is to store the last 3 steps.

Yes I did understand that.
I meant that N is the number of sequences and SL is the length of the sequence, thus why are you taking the features of the self.pred_step last sequences and not the features of the last steps of all the sequences?
Did I get it right?

If I understand correctly this probably helps:
The pipeline is,
input video: [B, N, 3, SL, 128, 128], e.g. [16,8,3,5,128,128]
extract feature z for all B*N samples, get [B,N,C,H,W], e.g. [16,8,256,4,4]
Up to now, we have 16 videos, and have feature map for 8 time steps.
Then we do the 5pred3 or 4pred4 task based on these features.
So we use the first part of them to predict the last part of them (on temporal axis).
Does this clarify your question?

I think I understand. Thanks a lot