KPQA

This repository provides an evaluation metric for generative question answering systems based on our NAACL 2021 paper KPQA: A Metric for Generative Question Answering Using Keyphrase Weights.
Here, we provide the code to train KPQA, pretrained model, human annotated data and the code to compute KPQA-metric.

The repository will soon be updated until 6/10 in a more useful form using demo in jupyter notebook.(weights will be uploaded to huggingface models)

Dataset

We provide human judgments of correctness for 4 datasets:MS-MARCO NLG, AVSD, Narrative QA and SemEval 2018 Task 11 (SemEval).
For MS-MARCO NLG and AVSD, we generate the answer using two models for each dataset. For NarrativeQA and SemEval, we preprocessed the dataset from [Evaluating Question Answering Evaluation](https://www.aclweb.org/anthology/D19-5817).

Usage

1. Install Prerequisites

Install packages using "requirements.txt"

2. Download Pretrained Model

We provide the pre-trained KPQA model in the following link.
https://drive.google.com/file/d/1pHQuPhf-LBFTBRabjIeTpKy3KGlMtyzT/view?usp=sharing
Download the "ckpt.zip" and extract it.

3. Compute Metric

You can compute KPQA-metric using "compute_correlation.py"

python compute_correlation.py \
--dataset marco \ # Target dataset to evaluate the metric
--qa_model unilm \ # The model used to generate answer.
--model_dir $CHECKPOINT_DIR \ # Path of checkpoint directory (extract path of "ckpt.zip")

For evaluating various metrics on MS-MARCO NLG dataset, the printed result (correlation with human judgments) will be as follows.

Metrics | Pearson | Spearman
BLEU-1 | 0.369 | 0.337
BLEU-4 | 0.173 | 0.224
ROUGE-L | 0.317 | 0.289
CIDEr | 0.261 | 0.256
BERTScore | 0.469 | 0.445
BLEU-1-KPQA | 0.729 | 0.676
ROUGE-L-KPQA | 0.735 | 0.674
BERTScore-KPQA | 0.698 | 0.66

Train KPQA (optional)

You can train your own KPQA model using the provided dataset or your own dataset using "train.py".
You can train using the default setting with "train_kpqa.sh"

Reference

If you find this repo useful, please consider citing:

@inproceedings{lee2021kpqa,
  title={KPQA: A Metric for Generative Question Answering Using Keyphrase Weights},
  author={Lee, Hwanhee and Yoon, Seunghyun and Dernoncourt, Franck and Kim, Doo Soon and Bui, Trung and Shin, Joongbo and Jung, Kyomin},
  booktitle={Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies},
  pages={2105--2115},
  year={2021}
}

About

KPQA is an evaluation metric for generative question answering.

Languages

Language:Jupyter Notebook 45.3%Language:Perl 28.6%Language:Python 26.0%Language:Shell 0.1%