[ASK] Why are the results from models so much lower than actual implementations of the model?

Question

[ASK] Why are the results from models so much lower than actual implementations of the model?

vedantc6 opened this issue 4 years ago · comments

Description

For example, NCF results come out as low as 0.1377 (NDCG@10) | 0.1215 (Precision@10) | 0.1033 (Recall@10) whereas the actual paper shows in 0.40s for NDCG. Even other NCF implementations show the evaluation near 0.40s

Other Comments

Aghiles · Answer 1 · Mon Apr 20 2020 14:38:15 GMT+0800 (China Standard Time)

Hi,

Thanks for asking.

In the NCF paper, item raking for every user is performed over 100 items only, i.e., 1 held-out test item and 99 randomly sampled negative items, please refer to the evaluation protocols paragraph in the NCF paper. In Cornac, item ranking for a given user is performed over all items she has not interacted with, which is a more reasonable approach, as in practice we do not know which item is positive/negative.

Vedant Choudhary · Answer 2 · Mon Apr 20 2020 23:12:06 GMT+0800 (China Standard Time)

Thank you for a prompt response. Much appreciated.

Vedant Choudhary · Answer 3 · Mon Apr 20 2020 23:21:34 GMT+0800 (China Standard Time)

Additionally, how have you calculated RMSE and MAE for NCF? The actual paper does not, because it is a ranking model not a rating prediction model.

Aghiles · Answer 4 · Tue Apr 21 2020 10:45:58 GMT+0800 (China Standard Time)

Technically RMSE and MAE can be computed for NCF, as for every user the model predicts scores over all items. However, as you mentioned, one should not consider these measures for evaluating NCF, which is a ranking model.