PSP@k: Propensity-scored precision at k

Question

PSP@k: Propensity-scored precision at k

celsofranssa opened this issue a year ago · comments

Celso França commented a year ago

I want to implement the propensity-scored precision at k (PSP@k) as defined above:

$PSP@k = \frac{1}{k} \sum \frac{y_i}{p_i}$

where $p_i$ is the propensity of $y_i$ and $1 \leq i \leq k$.

Therefore, how could I integrate this metric in ranx?

References:

[1] Zhang, J., Chang, W.C., Yu, H.F. and Dhillon, I., 2021. Fast multi-resolution transformer fine-tuning for extreme multi-label text classification. Advances in Neural Information Processing Systems, 34, pp.7267-7280.

Elias Bassani · Answer 1 · Sat Jul 01 2023 18:00:12 GMT+0800 (China Standard Time)

Hi, sorry for the delay.

Is this metric similar to reciprocal rank except you average over the top k positions instead of considering only the first retrieved relevant?
Could you please provide an example?

Celso França · Answer 2 · Wed Aug 02 2023 00:55:51 GMT+0800 (China Standard Time)

See below an example from pyxclib:

def psprecision(X, true_labels, inv_psp, k=5, sorted=False, use_cython=False):
    """
    Compute propensity scored precision@k for 1-k

    Arguments:
    ----------
    X: csr_matrix, np.ndarray or dict
        * csr_matrix: csr_matrix with nnz at relevant places
        * np.ndarray (float): scores for each label
            User must ensure shape is fine
        * np.ndarray (int): top indices (in sorted order)
            User must ensure shape is fine
        * {'indices': np.ndarray, 'scores': np.ndarray}
    true_labels: csr_matrix or np.ndarray
        ground truth in sparse or dense format
    inv_psp: np.ndarray
        propensity scores for each label
    k: int, optional (default=5)
        compute propensity scored precision till k
    sorted: boolean, optional, default=False
        whether X is already sorted (will skip sorting)
        * used when X is of type dict or np.ndarray (of indices)
        * shape is not checked is X are np.ndarray
        * must be set to true when X are np.ndarray (of indices)
    use_cython: boolean, optional, default=False
        whether to use cython version to find top-k element
        * defaults to numba version
        * may be useful when numba version fails on a system

    Returns:
    -------
    np.ndarray: propensity scored precision values for 1-k
    """
    indices, true_labels, ps_indices, inv_psp = _setup_metric(
        X, true_labels, inv_psp, k=k, sorted=sorted, use_cython=use_cython)
    eval_flags = _eval_flags(indices, true_labels, inv_psp)
    ps_eval_flags = _eval_flags(ps_indices, true_labels, inv_psp)
    return _precision(eval_flags, k)/_precision(ps_eval_flags, k)

Celso França · Answer 3 · Wed Aug 02 2023 01:14:33 GMT+0800 (China Standard Time)

A common characteristic of Extreme Multi-class Text Classification (XMTC) is the long tail distribution of huge label space (classes). Therefore it is recommended that XMTC methods also be evaluated with respect to propensity-scored metrics such as PSP@k (propensity-scored precision at k) and PSnDCG@k (propensity-scored nDCG at k) as described in Propensity-scored Performance at the Top.

Elias Bassani · Answer 4 · Mon Aug 28 2023 17:29:01 GMT+0800 (China Standard Time)

ranx is for evaluating ranking tasks, not classification ones.
I prefer to keep it this way for the moment.
Therefore, I am not going to add the requested metric.

Celso França · Answer 5 · Tue Oct 17 2023 07:14:54 GMT+0800 (China Standard Time)

ranx is for evaluating ranking tasks, not classification ones. I prefer to keep it this way for the moment. Therefore, I am not going to add the requested metric.

Even as a classification task, it is often approached using information retrieval methods. Since there are millions of labels, the common approach is to retrieve a set of labels and rank them. Therefore XMTC is a ranking task. Several papers employ MRR and nDCG as evaluating metrics.

However, the labels follow a long tail distribution, and for this reason, it is important to weigh them concerning their frequencies. It would be great if you rethink and publish also the Propensity-scored ranking metric as demonstrated below.

Celso França · Answer 6 · Tue Nov 28 2023 11:02:17 GMT+0800 (China Standard Time)

ranx is for evaluating ranking tasks, not classification ones. I prefer to keep it this way for the moment. Therefore, I am not going to add the requested metric.

Even as a classification task, it is often approached using information retrieval methods. Since there are millions of labels, the common approach is to retrieve a set of labels and rank them. Therefore XMTC is a ranking task. Several papers employ MRR and nDCG as evaluating metrics.

However, the labels follow a long tail distribution, and for this reason, it is important to weigh them concerning their frequencies. It would be great if you rethink and publish also the Propensity-scored ranking metric as demonstrated below.

Hello @AmenRa? What do you think? I think it is not too hard. Basically, we have to pass the doc's propensity (as a simple list of weights) to the desired metric. I could try to integrate if you help me.

Celso França · Answer 7 · Tue Nov 28 2023 23:01:45 GMT+0800 (China Standard Time)

I guess for PSP@K, I need to change hits += 1.0 to hits += 1.0*1/pl , where pl is the user-provided propensity.

def _hits(qrels, run, k, rel_lvl):
    qrels = clean_qrels(qrels, rel_lvl)
    if len(qrels) == 0:
        return 0.0


    k = fix_k(k, run)


    max_true_id = np.max(qrels[:, 0])
    min_true_id = np.min(qrels[:, 0])


    hits = 0.0


    for i in range(k):
        if run[i, 0] > max_true_id:
            continue
        if run[i, 0] < min_true_id:
            continue
        for j in range(qrels.shape[0]):
            if run[i, 0] == qrels[j, 0]:
                hits += 1.0
                break


    return hits

Is that right?

Elias Bassani · Answer 8 · Thu Nov 30 2023 17:01:49 GMT+0800 (China Standard Time)

I suggest you first provide some simple test cases for the propensity-based metrics.
I have no time to read the related papers or debug code in the wild.

Celso França · Answer 9 · Mon Dec 04 2023 23:53:47 GMT+0800 (China Standard Time)

I suggest you first provide some simple test cases for the propensity-based metrics. I have no time to read the related papers or debug code in the wild.

Hello @AmenRa, sorry for the late response. I've been working on the requested test cases, which took longer than expected.

Since this request was closed, I've provided the test cases and explanations in the new Feature Request.