TengdaHan / DPC

Video Representation Learning by Dense Predictive Coding. Tengda Han, Weidi Xie, Andrew Zisserman.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

The accuracy of SOTA on UCF101 is more than 98%, why DPC is worse?

PGogo opened this issue · comments

commented

As far as I kown, the state of the art on UCF101 is more than 98%, for example I3D. Also two stream got more than 88%. But DPC got about 65% reported in the paper (even with supervised learning as these methods). What do I miss in the paper?

You could read some papers about 'self-supervised learning' on either images or videos, also the reference in our paper like OPN (Lee et al.), 3D-ST-Puzzle (Kim et al.).
These 98% and 88% results you mentioned are finetuned with a pretrained network: supervised pretrained on larger dataset like Kinetics (for I3D) or ImageNet (for two-stream), which requires expensive annotation.
Self-supervised learning doesn't require labels to learn the representation.

commented

You could read some papers about 'self-supervised learning' on either images or videos, also the reference in our paper like OPN (Lee et al.), 3D-ST-Puzzle (Kim et al.).
These 98% and 88% results you mentioned are finetuned with a pretrained network: supervised pretrained on larger dataset like Kinetics (for I3D) or ImageNet (for two-stream), which requires expensive annotation.
Self-supervised learning doesn't require labels to learn the representation.

Thanks for your prompt reply ! But I think that it may require labels in the downstream classification task, since it has to output the highest score to match the label. Is this part require labels as supervised learning? If yes, then it is the same as supervised learning, right ? Then the reduceing performance may only due to the pretrained model? I'm curious :)

Evaluating the feature quality by finetuning on action classification task (requiring label) on smaller datasets is a conventional evaluation method for videos. Yes, the downstream task performance reflects the quality of the pretraining feature.
When comparing the self-supervised feature against the fully-supervised feature, the performance is not necessary 'reducing'. If you check the two-stream paper (Simonyan and Zisserman, 2014) Table 4, spatial stream on UCF101 result with ImageNet pretraining is 73%, but our self-supervised pretraining gets 76%. Self-supervised learning is very promising as you have unlimited data available from the internet.