Many noun and verb classes are not covered in the training and validation set

Question

Many noun and verb classes are not covered in the training and validation set

YuanGongND opened this issue 2 years ago · comments

Hi there,

Thanks so much for building this great dataset!

I have a question regarding the annotations.

By looking into EPIC_100_verb_classes.csv and EPIC_100_validation.csv, I find quite a lot (>10) verb classes are not covered in the validation set. Same thing happens for the training set and the noun classes. I am wondering if it is expected? How did you report the accuracy when some classes are not in the validation set for the baseline? Also, is it correct to use the noun_class (Numeric ID of the first noun's class) as the target and ignore all other nouns for the action recognition task (i.e., view it as a single-class classification problem)?

Thanks very much!

Best,
Yuan

Dima Damen · Answer 1 · Sun Nov 13 2022 17:44:48 GMT+0800 (China Standard Time)

Thanks for your questions.

As specified in the paper, we create our train/val/test splits by separating full videos, so indeed occurrences of classes can only exist in one split. We even have a few zero-shot cases. So yes, this is expected and described in the paper.

You can check all the evaluation metrics and how they were calculated for these cases in our released evaluation metrics code here: https://github.com/epic-kitchens/C1-Action-Recognition. You can check how these cases are addressed.

For the noun_class, we indeed take the main noun class (not necessarily the first noun and not a single word) and ignore the other ones as these are inconsistent. Let me explain. For example if the narration is: "put chopping board down on table", we first parse the sentence which gives the parse: (put -> down) -> (board -> chopping) -> (on -> table). [Check Spacy for details].
We then take the noun that is associated with the verb, in this case (chopping board) [which is more than one word]. These phrases are grouped into classes. We do not take the second noun (on table) as this is not consistently annotated - for certain narrations we might have only "put down chopping board". Given the additional nouns are not always labelled, we cannot consistently use them for evaluation of the classification task.

Additional notes:
*) all nouns are used in the retrieval task,
*) check our latest VISOR annotations for all action-relevant objects. Often there are 4-5 actions relevant objects. These are not used in the classification task rightly but are annotated for the VISOR subset should you need those.