Adding Public Health QA dataset

Question

Adding Public Health QA dataset

xhluca opened this issue a month ago · comments

I plan to contribute a public health-related dataset sourced from COVID-19 related question and answer pairs from major public health authorities. Before opening the PR, I had a few questions:

Does the dataset have to be sourced from a published paper, or is a GitHub repository fine?
Given it has ~800 pairs from 8 languages (50-200 per language), would it be better categorized under reranking or retrieval?

Thanks!

Imene Kerboua · Answer 1 · Fri May 17 2024 03:44:34 GMT+0800 (China Standard Time)

Hello,

We'd love to have medical datasets as we're handling multiple domains!

For your questions:

It would be better if we have a source for the dataset to asses its quality. Although you can still open a PR and I guess we'll determine this with the metadata.
Atm we used most QA datasets as retieval datasets as the integration is easier. So it's better if you start with retrieval, reranking would require some preprocessing to build negative pairs, but still doable.

Xing Han Lu · Answer 2 · Fri May 17 2024 03:50:42 GMT+0800 (China Standard Time)

For the source, I created the dataset which was retrieved from CDC, WHO, etc. I can link to the repository I created a few years ago. I will start the PR and we can decide from there.
Makes sense! I will start with retrieval, if later we feel reranking is better I can change it.

Xing Han Lu · Answer 3 · Tue May 21 2024 01:56:38 GMT+0800 (China Standard Time)

I've added the dataset in #750