how to determin the proper nis threshold for each hoi category?

Question

how to determin the proper nis threshold for each hoi category?

wanna-fly opened this issue 4 years ago · comments

Hi, thanks for your wonderful job!
NIS strategy has been proved to be very effective for HOI detection task. As shown in the code, both TIN and IDN set different NIS thresholds for different hoi categories. Could please tell me how do you search it?
Thanks a lot!

Yong-Lu Li · Answer 1 · Tue Jan 05 2021 18:01:49 GMT+0800 (China Standard Time)

A common way is to randomly select some samples from the train set for each HOI and construct a validation set, then use a grid search to find an optimal threshold for each HOI. But for different models, NIS may perform differently for the same thresholds, due to the various prediction biases of different models.

For some rare HOIs, please be careful. Because NIS may delete all samples in an image when all pairs are below the threshold. Sometimes, rare HOIs containing only one or several images are better to be excluded from NIS.

wanna-fly · Answer 2 · Wed Jan 06 2021 00:23:43 GMT+0800 (China Standard Time)

Get it. Thanks for your answer^_^.
By the way, in the inference stage of IDN, it seems that the model output would be devided by a factor(fac_i, fac_a and fac_d) before mAP evaluation. what's the meaning of these factors and why are they different?

Xinpeng Liu · Answer 3 · Mon Jan 11 2021 15:34:36 GMT+0800 (China Standard Time)

We incorporate three streams into one final inference result.
AE stream outputs logits and uses sigmoid function to project the output into probability, and the other two streams output distances and use exponentional function to project the output into probability.
This makes the combination tricky, since the diverse characteristic held by the three streams.
Besides, different streams also learn different biases, which also needs mitigating.

To solve this, we take inspiration from LIS, proposed by TIN. We impose an LIS-like function on the output of the three streams before projecting them into the probability space. In detail, we divide the output logits/distances with a factor for each class. By doing this, we expect that the transformed probability of different streams could be coherent, and the bias of different streams could be mitigated.

We got the factors by collecting about 6000 images from the training set as the validation set, then providing 20 candidate factors per class per stream. The goal is to select the factors that could produce the best performance on the collected validation set. By the way, we found that it tends to assign smaller factors for streams that performs relatively better on a class.
The factors are also learnable, however, the performance wasn't as good as choosing from given candidates, which we attributes to the over-detailed characteristic of the learning process.