embeddings-benchmark / mteb

MTEB: Massive Text Embedding Benchmark

Home Page:https://arxiv.org/abs/2210.07316

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Adding Public Health QA dataset

xhluca opened this issue · comments

I plan to contribute a public health-related dataset sourced from COVID-19 related question and answer pairs from major public health authorities. Before opening the PR, I had a few questions:

  • Does the dataset have to be sourced from a published paper, or is a GitHub repository fine?
  • Given it has ~800 pairs from 8 languages (50-200 per language), would it be better categorized under reranking or retrieval?

Thanks!

Hello,

We'd love to have medical datasets as we're handling multiple domains!

For your questions:

  • It would be better if we have a source for the dataset to asses its quality. Although you can still open a PR and I guess we'll determine this with the metadata.
  • Atm we used most QA datasets as retieval datasets as the integration is easier. So it's better if you start with retrieval, reranking would require some preprocessing to build negative pairs, but still doable.
  • For the source, I created the dataset which was retrieved from CDC, WHO, etc. I can link to the repository I created a few years ago. I will start the PR and we can decide from there.
  • Makes sense! I will start with retrieval, if later we feel reranking is better I can change it.

I've added the dataset in #750