Incorrect result for f1 score
crux-taaniya opened this issue · comments
Using f1 or _f1_parallel for all qrels and run gives incorrect output. But if I use _f1 on individual query case, it gives correct F1 score.
using below 2 functions return 0 for 4 cases. Ideally it should only be 0 for 1 of all the 18 cases in _qrels & _run passed.
from ranx.metrics.f1 import _f1_parallel, _f1, f1
f1(_qrels, _run, 1, 1)
# output
# array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 0., 0., 0.])
_f1_parallel(_qrels, _run, 1 , 1)
# output
# array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 0., 0., 0.])
But if I call _f1 for each item in qrels, I get correct F1 score for all queries.
scores = []
for i in prange(len(_qrels)):
try:
scores.append(_f1(_qrels[i], _run[i], 1 , 1))
except Exception as error:
# handle the exception
print(f" {i} An exception occurred:", error)
scores.append(0)
continue
# output
# [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0, 1.0, 1.0, 1.0]
Notice, I get only 1 case where the F1 score is 0, which I think is expected for my case.
More context -
I am using f1 metric at k =1 for a dataset where each query has only 1 relevant document. There are 18 unique qrels.
Since each query only has 1 relevant document, the score for mrr@1 = recall@1 = 0.9444
also, precision@1 = 0.944 for my case.
Here's the output of my evaluation -
{'mrr@1': 0.9444444444444444, 'mrr@2': 0.9444444444444444, 'recall@1': 0.9444444444444444, 'recall@2': 0.9444444444444444, 'precision@1': 0.9444444444444444, 'f1@1': 0.7777777777777778}
f1@1 seems very low considering precision & recall @ 1 are equal with high value.
Here's the pretty output of scores for each individual query -
I've manually validated the output for all the metrics and all of them seem correct to me except for f1@1 .
Notice that hits score is 0 only for q_id '6', so f1 score 0 for q_id '6' is expected, but f1 score is also 0 for q_id '7' , '8' and '9' .
{
"mrr": {
"0": 1.0,
"1": 1.0,
"10": 1.0,
"11": 1.0,
"12": 1.0,
"13": 1.0,
"14": 1.0,
"15": 1.0,
"16": 1.0,
"17": 1.0,
"2": 1.0,
"3": 1.0,
"4": 1.0,
"5": 1.0,
"6": 0.3333333333333333,
"7": 1.0,
"8": 1.0,
"9": 1.0
},
"mrr@1": {
"0": 1.0,
"1": 1.0,
"10": 1.0,
"11": 1.0,
"12": 1.0,
"13": 1.0,
"14": 1.0,
"15": 1.0,
"16": 1.0,
"17": 1.0,
"2": 1.0,
"3": 1.0,
"4": 1.0,
"5": 1.0,
"6": 0.0,
"7": 1.0,
"8": 1.0,
"9": 1.0
},
"recall@1": {
"0": 1.0,
"1": 1.0,
"10": 1.0,
"11": 1.0,
"12": 1.0,
"13": 1.0,
"14": 1.0,
"15": 1.0,
"16": 1.0,
"17": 1.0,
"2": 1.0,
"3": 1.0,
"4": 1.0,
"5": 1.0,
"6": 0.0,
"7": 1.0,
"8": 1.0,
"9": 1.0
},
"precision@1": {
"0": 1.0,
"1": 1.0,
"10": 1.0,
"11": 1.0,
"12": 1.0,
"13": 1.0,
"14": 1.0,
"15": 1.0,
"16": 1.0,
"17": 1.0,
"2": 1.0,
"3": 1.0,
"4": 1.0,
"5": 1.0,
"6": 0.0,
"7": 1.0,
"8": 1.0,
"9": 1.0
},
"f1@1": {
"0": 1.0,
"1": 1.0,
"10": 1.0,
"11": 1.0,
"12": 1.0,
"13": 1.0,
"14": 1.0,
"15": 1.0,
"16": 1.0,
"17": 1.0,
"2": 1.0,
"3": 1.0,
"4": 1.0,
"5": 1.0,
"6": 0.0,
"7": 0.0,
"8": 0.0,
"9": 0.0
},
"hits@1": {
"0": 1.0,
"1": 1.0,
"10": 1.0,
"11": 1.0,
"12": 1.0,
"13": 1.0,
"14": 1.0,
"15": 1.0,
"16": 1.0,
"17": 1.0,
"2": 1.0,
"3": 1.0,
"4": 1.0,
"5": 1.0,
"6": 0.0,
"7": 1.0,
"8": 1.0,
"9": 1.0
}
}
Thanks in advance.
Let me me know if I am missing something.
Hi, it is very strange that _f1
called on every query separately and _f1_parallel
give different results.
_f1_parallel
is just a parallelized loop over _f1
and the queries.
Can you provide me your qrels
and run
?
Hi,
thanks for the quick response.
Have attached the files here.
I used below code to create _qrels
and _run
variables from above excel files
qrels_df = pd.read_excel(qrels_path)
run_df = pd.read_excel(preds_run_path)
# drop rows where there is no doc_id for a query
run_df = run_df.dropna(subset=['doc_id'])
# cast doc_id to int
run_df['doc_id'] = run_df['doc_id'].astype('int')
qrels_df['doc_id'] = 'd_' + qrels_df['doc_id'].astype(str)
run_df['doc_id'] = 'd_' + run_df['doc_id'].astype(str)
# cast q_id columns to string type
qrels_df['q_id'] = qrels_df['q_id'].astype(str)
run_df['q_id'] = run_df['q_id'].astype(str)
# Create qrels object
qrels = Qrels.from_df( df=qrels_df, q_id_col="q_id", doc_id_col="doc_id")
# Create run object
run = Run.from_df(df=run_df,
q_id_col="q_id",
doc_id_col="doc_id",
score_col="cosine_similarity")
# perform evaluation
score_dict = evaluate(qrels, run, ["mrr", "mrr@1", "mrr@2", "recall@1", "recall@2", "precision@1", "f1@1", "hits@1"])
print(score_dict)
# pretty print scores
print(json.dumps(run.scores, indent=4))
# convert qrels and run object to numba typed list to call functions f1, _f1, _f1_parallel
_qrels = qrels.to_typed_list()
_run = run.to_typed_list()
Do you have non-alphanumeric document IDs on your side?
Yes. The document IDs are originally in numeric form when entered in excel.
I have also committed this pipeline here
Could you take a look and try it on your end when you get a chance ?
I am using colab with cpu runtime to run this pipeline where I see this issue.
Once again, thanks for trying to replicate it on your end.
I can replicate on Colab.
There is a zero division not handled, as you noticed.
Unfortunately, numba
is not raising it and it probably causes an unreported crash of the numba
's threads or something.
The issue arises after the query with no retrieved relevants is evaluated.
I fixed the zero division case on Colab and everything looks fine.
I will do some more testing locally and notify you when the bug is fixed.
Thanks for reporting the issue!
@AmenRa @crux-taaniya how about opening an issue in numba and track it here? this is something serious they might want to fix probably.
@diegoceccarelli I was going to do it this morning, but then I saw that there is already an open issue about it.
numba/numba#6976
I can replicate on Colab. There is a zero division not handled, as you noticed. Unfortunately,
numba
is not raising it and it probably causes an unreported crash of thenumba
's threads or something. The issue arises after the query with no retrieved relevants is evaluated. I fixed the zero division case on Colab and everything looks fine. I will do some more testing locally and notify you when the bug is fixed. Thanks for reporting the issue!
Thanks @AmenRa !
Fixed in v0.3.16
.