Incorrect result for f1 score

Question

Incorrect result for f1 score

crux-taaniya opened this issue 10 months ago · comments

Using f1 or _f1_parallel for all qrels and run gives incorrect output. But if I use _f1 on individual query case, it gives correct F1 score.

using below 2 functions return 0 for 4 cases. Ideally it should only be 0 for 1 of all the 18 cases in _qrels & _run passed.

from ranx.metrics.f1 import _f1_parallel, _f1, f1

f1(_qrels, _run, 1, 1)
# output 
# array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 0., 0., 0.])

_f1_parallel(_qrels, _run, 1 , 1)
# output 
# array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 0., 0., 0.])

But if I call _f1 for each item in qrels, I get correct F1 score for all queries.

scores = []

for i in prange(len(_qrels)):
  try:
      scores.append(_f1(_qrels[i], _run[i], 1 , 1))
  except Exception as error:
    # handle the exception
    print(f" {i} An exception occurred:", error)
    scores.append(0)
    continue

# output
# [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0, 1.0, 1.0, 1.0]

Notice, I get only 1 case where the F1 score is 0, which I think is expected for my case.

More context -

I am using f1 metric at k =1 for a dataset where each query has only 1 relevant document. There are 18 unique qrels.
Since each query only has 1 relevant document, the score for mrr@1 = recall@1 = 0.9444

also, precision@1 = 0.944 for my case.

Here's the output of my evaluation -
{'mrr@1': 0.9444444444444444, 'mrr@2': 0.9444444444444444, 'recall@1': 0.9444444444444444, 'recall@2': 0.9444444444444444, 'precision@1': 0.9444444444444444, 'f1@1': 0.7777777777777778}

f1@1 seems very low considering precision & recall @ 1 are equal with high value.

Here's the pretty output of scores for each individual query -
I've manually validated the output for all the metrics and all of them seem correct to me except for f1@1 .
Notice that hits score is 0 only for q_id '6', so f1 score 0 for q_id '6' is expected, but f1 score is also 0 for q_id '7' , '8' and '9' .

{
"mrr": {
"0": 1.0,
"1": 1.0,
"10": 1.0,
"11": 1.0,
"12": 1.0,
"13": 1.0,
"14": 1.0,
"15": 1.0,
"16": 1.0,
"17": 1.0,
"2": 1.0,
"3": 1.0,
"4": 1.0,
"5": 1.0,
"6": 0.3333333333333333,
"7": 1.0,
"8": 1.0,
"9": 1.0
},
"mrr@1": {
"0": 1.0,
"1": 1.0,
"10": 1.0,
"11": 1.0,
"12": 1.0,
"13": 1.0,
"14": 1.0,
"15": 1.0,
"16": 1.0,
"17": 1.0,
"2": 1.0,
"3": 1.0,
"4": 1.0,
"5": 1.0,
"6": 0.0,
"7": 1.0,
"8": 1.0,
"9": 1.0
},
"recall@1": {
"0": 1.0,
"1": 1.0,
"10": 1.0,
"11": 1.0,
"12": 1.0,
"13": 1.0,
"14": 1.0,
"15": 1.0,
"16": 1.0,
"17": 1.0,
"2": 1.0,
"3": 1.0,
"4": 1.0,
"5": 1.0,
"6": 0.0,
"7": 1.0,
"8": 1.0,
"9": 1.0
},
"precision@1": {
"0": 1.0,
"1": 1.0,
"10": 1.0,
"11": 1.0,
"12": 1.0,
"13": 1.0,
"14": 1.0,
"15": 1.0,
"16": 1.0,
"17": 1.0,
"2": 1.0,
"3": 1.0,
"4": 1.0,
"5": 1.0,
"6": 0.0,
"7": 1.0,
"8": 1.0,
"9": 1.0
},
"f1@1": {
"0": 1.0,
"1": 1.0,
"10": 1.0,
"11": 1.0,
"12": 1.0,
"13": 1.0,
"14": 1.0,
"15": 1.0,
"16": 1.0,
"17": 1.0,
"2": 1.0,
"3": 1.0,
"4": 1.0,
"5": 1.0,
"6": 0.0,
"7": 0.0,
"8": 0.0,
"9": 0.0
},
"hits@1": {
"0": 1.0,
"1": 1.0,
"10": 1.0,
"11": 1.0,
"12": 1.0,
"13": 1.0,
"14": 1.0,
"15": 1.0,
"16": 1.0,
"17": 1.0,
"2": 1.0,
"3": 1.0,
"4": 1.0,
"5": 1.0,
"6": 0.0,
"7": 1.0,
"8": 1.0,
"9": 1.0
}
}

Thanks in advance.
Let me me know if I am missing something.

Elias Bassani · Answer 1 · Fri Jul 28 2023 00:12:01 GMT+0800 (China Standard Time)

Hi, it is very strange that _f1 called on every query separately and _f1_parallel give different results.
_f1_parallel is just a parallelized loop over _f1 and the queries.

Can you provide me your qrels and run?

crux-taaniya · Answer 2 · Fri Jul 28 2023 01:07:08 GMT+0800 (China Standard Time)

Hi,
thanks for the quick response.
Have attached the files here.

qrels_v2.xlsx
run.xlsx

crux-taaniya · Answer 3 · Fri Jul 28 2023 01:17:12 GMT+0800 (China Standard Time)

I used below code to create _qrels and _run variables from above excel files

qrels_df = pd.read_excel(qrels_path)
run_df = pd.read_excel(preds_run_path)

# drop rows where there is no doc_id for a query
run_df = run_df.dropna(subset=['doc_id'])

# cast doc_id to int 
run_df['doc_id'] = run_df['doc_id'].astype('int')

qrels_df['doc_id'] = 'd_' + qrels_df['doc_id'].astype(str)
run_df['doc_id'] = 'd_' + run_df['doc_id'].astype(str)

# cast q_id columns to string type
qrels_df['q_id'] = qrels_df['q_id'].astype(str)
run_df['q_id'] = run_df['q_id'].astype(str)

# Create qrels object
qrels = Qrels.from_df( df=qrels_df,  q_id_col="q_id", doc_id_col="doc_id")

# Create run object
run = Run.from_df(df=run_df,
                  q_id_col="q_id",
                  doc_id_col="doc_id",
                  score_col="cosine_similarity")

# perform evaluation 

score_dict = evaluate(qrels, run, ["mrr", "mrr@1", "mrr@2", "recall@1", "recall@2", "precision@1", "f1@1", "hits@1"])
print(score_dict)

# pretty print scores
print(json.dumps(run.scores, indent=4))

# convert qrels and run object to numba typed list to call functions f1, _f1, _f1_parallel 
_qrels = qrels.to_typed_list()
_run = run.to_typed_list()

Elias Bassani · Answer 4 · Fri Jul 28 2023 02:43:16 GMT+0800 (China Standard Time)

I cannot reproduce your results. F1@1 is correct on my side. 🤷🏻‍♂️

Elias Bassani · Answer 5 · Fri Jul 28 2023 02:46:42 GMT+0800 (China Standard Time)

By calling the internal functions instead of using evaluate, I get again the correct results.

Elias Bassani · Answer 6 · Fri Jul 28 2023 02:49:06 GMT+0800 (China Standard Time)

Do you have non-alphanumeric document IDs on your side?

Taaniya Arora · Answer 7 · Fri Jul 28 2023 04:44:00 GMT+0800 (China Standard Time)

Yes. The document IDs are originally in numeric form when entered in excel.

I have also committed this pipeline here

Could you take a look and try it on your end when you get a chance ?

I am using colab with cpu runtime to run this pipeline where I see this issue.

Taaniya Arora · Answer 8 · Fri Jul 28 2023 04:44:47 GMT+0800 (China Standard Time)

Once again, thanks for trying to replicate it on your end.

Elias Bassani · Answer 9 · Fri Jul 28 2023 15:18:33 GMT+0800 (China Standard Time)

I can replicate on Colab.
There is a zero division not handled, as you noticed.
Unfortunately, numba is not raising it and it probably causes an unreported crash of the numba's threads or something.
The issue arises after the query with no retrieved relevants is evaluated.
I fixed the zero division case on Colab and everything looks fine.
I will do some more testing locally and notify you when the bug is fixed.
Thanks for reporting the issue!

Diego · Answer 10 · Fri Jul 28 2023 17:51:54 GMT+0800 (China Standard Time)

@AmenRa @crux-taaniya how about opening an issue in numba and track it here? this is something serious they might want to fix probably.

Elias Bassani · Answer 11 · Fri Jul 28 2023 17:58:11 GMT+0800 (China Standard Time)

@diegoceccarelli I was going to do it this morning, but then I saw that there is already an open issue about it.
numba/numba#6976

crux-taaniya · Answer 12 · Fri Jul 28 2023 18:27:52 GMT+0800 (China Standard Time)

I can replicate on Colab. There is a zero division not handled, as you noticed. Unfortunately, numba is not raising it and it probably causes an unreported crash of the numba's threads or something. The issue arises after the query with no retrieved relevants is evaluated. I fixed the zero division case on Colab and everything looks fine. I will do some more testing locally and notify you when the bug is fixed. Thanks for reporting the issue!

Thanks @AmenRa !

Elias Bassani · Answer 13 · Thu Aug 03 2023 23:48:34 GMT+0800 (China Standard Time)

Fixed in v0.3.16.