embeddings-benchmark / mteb

MTEB: Massive Text Embedding Benchmark

Home Page:https://arxiv.org/abs/2210.07316

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

EstQA contains multiple ids for the same context document

x-tabdeveloping opened this issue · comments

Something I noticed is that the way I uploaded EstQA to HuggingFace and then used it in the benchmark is not suitable for a retrieval task, as multiple contexts can belong to the same question and this is not accounted for.

Potential fixes:

  1. Just use the answer as the retrieved passage instead of the context.
  2. Fix the dataset on HuggingFace with multiple tables.

Since you can have multiple positive retrievals shouldn't you simply have multiple positive pairs?

The problem is that since the same context gets multiple IDs even when the model retrieves the correct context for a question it might get detected as false positive. But I'm on it, submitting a PR soon.