Questions about the potential unfair comparison in the experiments

Question

Questions about the potential unfair comparison in the experiments

BigSorryMaker opened this issue 10 months ago · comments

Hi, I have some questions about the paper's experimental settings and would like your answers.

In a common recommendation evaluation, the final recommendation list generated will not include items that the user has already interacted with in the training set. However, I find that in the code, these interactions appear in the recommendation lists used for the evaluation. For example, for the fairness-unaware mode 'N' that does not take any fairness into account, the final recommendation list 'W' will select the k-largest 'S' as the objective is just "maximize(xsum((S[i][j] * W[i][j])". However, since the scores of 'S' are predictions of the model and have not been processed in any way (e.g., setting the prediction scores of interactions in the training set to -np.inf), interactions in the training set also participate in the evaluation with high prediction scores and eventually appear in the recommendation list (W=1). Compared to a common evaluation setting, the current setup leads to worse evaluation results for these fairness-unaware methods as it puts some "0" in the recommendation list.

Although all methods, including CPFair, encountered this problem, I am concerned that this may lead to unfair comparisons, which may result in spurious performance improvements for current re-ranking methods. Specifically, since CPFair keeps the training NDCGs of different user groups close to each other on the training set interactions, CPFair tends to remove some of the training interactions from the recommendation list for user groups that have high original training NDCGs, which can provide a spurious boost to the overall recommendation performance. Considering that the RS does not recommend items that the user has already interacted with in the common settings, I'm wondering whether CPFair can improve recommendation performance and fairness with the common settings.

Another small problem is that the current code doesn't seem to calculate the average training NDCG for each user group but rather the sum of the NDCG for each user group, which may cause CPFair to remove more training set interactions during testing.

Feel free to point out any misunderstandings I may have. Thanks!