AmenRa / ranx

⚡️A Blazing-Fast Python Library for Ranking Evaluation, Comparison, and Fusion 🐍

Home Page:https://amenra.github.io/ranx

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Incorrect result for f1 score

crux-taaniya opened this issue · comments

Using f1 or _f1_parallel for all qrels and run gives incorrect output. But if I use _f1 on individual query case, it gives correct F1 score.

using below 2 functions return 0 for 4 cases. Ideally it should only be 0 for 1 of all the 18 cases in _qrels & _run passed.

from ranx.metrics.f1 import _f1_parallel, _f1, f1

f1(_qrels, _run, 1, 1)
# output 
# array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 0., 0., 0.])

_f1_parallel(_qrels, _run, 1 , 1)
# output 
# array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 0., 0., 0.])

But if I call _f1 for each item in qrels, I get correct F1 score for all queries.

scores = []

for i in prange(len(_qrels)):
  try:
      scores.append(_f1(_qrels[i], _run[i], 1 , 1))
  except Exception as error:
    # handle the exception
    print(f" {i} An exception occurred:", error)
    scores.append(0)
    continue

# output
# [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0, 1.0, 1.0, 1.0]

Notice, I get only 1 case where the F1 score is 0, which I think is expected for my case.

More context -

I am using f1 metric at k =1 for a dataset where each query has only 1 relevant document. There are 18 unique qrels.
Since each query only has 1 relevant document, the score for mrr@1 = recall@1 = 0.9444

also, precision@1 = 0.944 for my case.

Here's the output of my evaluation -
{'mrr@1': 0.9444444444444444, 'mrr@2': 0.9444444444444444, 'recall@1': 0.9444444444444444, 'recall@2': 0.9444444444444444, 'precision@1': 0.9444444444444444, 'f1@1': 0.7777777777777778}

f1@1 seems very low considering precision & recall @ 1 are equal with high value.

Here's the pretty output of scores for each individual query -
I've manually validated the output for all the metrics and all of them seem correct to me except for f1@1 .
Notice that hits score is 0 only for q_id '6', so f1 score 0 for q_id '6' is expected, but f1 score is also 0 for q_id '7' , '8' and '9' .

{
"mrr": {
"0": 1.0,
"1": 1.0,
"10": 1.0,
"11": 1.0,
"12": 1.0,
"13": 1.0,
"14": 1.0,
"15": 1.0,
"16": 1.0,
"17": 1.0,
"2": 1.0,
"3": 1.0,
"4": 1.0,
"5": 1.0,
"6": 0.3333333333333333,
"7": 1.0,
"8": 1.0,
"9": 1.0
},
"mrr@1": {
"0": 1.0,
"1": 1.0,
"10": 1.0,
"11": 1.0,
"12": 1.0,
"13": 1.0,
"14": 1.0,
"15": 1.0,
"16": 1.0,
"17": 1.0,
"2": 1.0,
"3": 1.0,
"4": 1.0,
"5": 1.0,
"6": 0.0,
"7": 1.0,
"8": 1.0,
"9": 1.0
},
"recall@1": {
"0": 1.0,
"1": 1.0,
"10": 1.0,
"11": 1.0,
"12": 1.0,
"13": 1.0,
"14": 1.0,
"15": 1.0,
"16": 1.0,
"17": 1.0,
"2": 1.0,
"3": 1.0,
"4": 1.0,
"5": 1.0,
"6": 0.0,
"7": 1.0,
"8": 1.0,
"9": 1.0
},
"precision@1": {
"0": 1.0,
"1": 1.0,
"10": 1.0,
"11": 1.0,
"12": 1.0,
"13": 1.0,
"14": 1.0,
"15": 1.0,
"16": 1.0,
"17": 1.0,
"2": 1.0,
"3": 1.0,
"4": 1.0,
"5": 1.0,
"6": 0.0,
"7": 1.0,
"8": 1.0,
"9": 1.0
},
"f1@1": {
"0": 1.0,
"1": 1.0,
"10": 1.0,
"11": 1.0,
"12": 1.0,
"13": 1.0,
"14": 1.0,
"15": 1.0,
"16": 1.0,
"17": 1.0,
"2": 1.0,
"3": 1.0,
"4": 1.0,
"5": 1.0,
"6": 0.0,
"7": 0.0,
"8": 0.0,
"9": 0.0
},
"hits@1": {
"0": 1.0,
"1": 1.0,
"10": 1.0,
"11": 1.0,
"12": 1.0,
"13": 1.0,
"14": 1.0,
"15": 1.0,
"16": 1.0,
"17": 1.0,
"2": 1.0,
"3": 1.0,
"4": 1.0,
"5": 1.0,
"6": 0.0,
"7": 1.0,
"8": 1.0,
"9": 1.0
}
}

Thanks in advance.
Let me me know if I am missing something.

Hi, it is very strange that _f1 called on every query separately and _f1_parallel give different results.
_f1_parallel is just a parallelized loop over _f1 and the queries.

Can you provide me your qrels and run?

Hi,
thanks for the quick response.
Have attached the files here.

qrels_v2.xlsx
run.xlsx

I used below code to create _qrels and _run variables from above excel files

qrels_df = pd.read_excel(qrels_path)
run_df = pd.read_excel(preds_run_path)

# drop rows where there is no doc_id for a query
run_df = run_df.dropna(subset=['doc_id'])

# cast doc_id to int 
run_df['doc_id'] = run_df['doc_id'].astype('int')

qrels_df['doc_id'] = 'd_' + qrels_df['doc_id'].astype(str)
run_df['doc_id'] = 'd_' + run_df['doc_id'].astype(str)

# cast q_id columns to string type
qrels_df['q_id'] = qrels_df['q_id'].astype(str)
run_df['q_id'] = run_df['q_id'].astype(str)

# Create qrels object
qrels = Qrels.from_df( df=qrels_df,  q_id_col="q_id", doc_id_col="doc_id")

# Create run object
run = Run.from_df(df=run_df,
                  q_id_col="q_id",
                  doc_id_col="doc_id",
                  score_col="cosine_similarity")

# perform evaluation 

score_dict = evaluate(qrels, run, ["mrr", "mrr@1", "mrr@2", "recall@1", "recall@2", "precision@1", "f1@1", "hits@1"])
print(score_dict)

# pretty print scores
print(json.dumps(run.scores, indent=4))

# convert qrels and run object to numba typed list to call functions f1, _f1, _f1_parallel 
_qrels = qrels.to_typed_list()
_run = run.to_typed_list()

I cannot reproduce your results. F1@1 is correct on my side. 🤷🏻‍♂️

Screenshot 2023-07-27 at 20 42 45

By calling the internal functions instead of using evaluate, I get again the correct results.

Screenshot 2023-07-27 at 20 45 56

Do you have non-alphanumeric document IDs on your side?

Yes. The document IDs are originally in numeric form when entered in excel.

I have also committed this pipeline here

Could you take a look and try it on your end when you get a chance ?

I am using colab with cpu runtime to run this pipeline where I see this issue.

Once again, thanks for trying to replicate it on your end.

I can replicate on Colab.
There is a zero division not handled, as you noticed.
Unfortunately, numba is not raising it and it probably causes an unreported crash of the numba's threads or something.
The issue arises after the query with no retrieved relevants is evaluated.
I fixed the zero division case on Colab and everything looks fine.
I will do some more testing locally and notify you when the bug is fixed.
Thanks for reporting the issue!

@AmenRa @crux-taaniya how about opening an issue in numba and track it here? this is something serious they might want to fix probably.

@diegoceccarelli I was going to do it this morning, but then I saw that there is already an open issue about it.
numba/numba#6976

I can replicate on Colab. There is a zero division not handled, as you noticed. Unfortunately, numba is not raising it and it probably causes an unreported crash of the numba's threads or something. The issue arises after the query with no retrieved relevants is evaluated. I fixed the zero division case on Colab and everything looks fine. I will do some more testing locally and notify you when the bug is fixed. Thanks for reporting the issue!

Thanks @AmenRa !

Fixed in v0.3.16.