When to Retrieve

This is the official repo of the paper ACL'24: When Do LLMs Need Retrieval Augmentation? Mitigating LLMs’ Overconfidence Helps Retrieval Augmentation

Basic Usage

There are four steps to get the desired responses.

Inference

Step1: Run run_llm.py to get the basic results
```
python run_llm.py --source data/nq_sample.jsonl --ra none --type prior --outfile ./examples/test.jsonl --model chatgpt
```
- You can specify --type [qa/qa_explain/qa_cot/qa_gene/prior/prior_punish/prior_explain/prior_pun_exp/prior_cot/prior_gene]
Note
- The desired format of the response is [Answer, Confidence]. For example, when the question is What is the capital of Chain, we expect the response to be "Beijing, Certain"
- LLMs might produce responses that don't adhere to the expected format. It could simply provide the answer Beijing or the confidence Uncertain.
- We perform post-processing to obtain the desired output format.

Post-process & Evaluate

Step2: Get the indices of samples that do not match the expected output format.

python collect.py --mode preprocess --source ./data/nq_sample.jsonl --input ./examples/test.jsonl --output ./examples/test_new.jsonl --confidence ./examples/confidence.jsonl --answer ./examples/answer.jsonl --model chatgpt

Step3: Generate the missing results for these samples.

python run_llm.py --source data/nq_sample.jsonl --ra none --type qa --outfile ./examples/post_answer.jsonl --idx ./examples/answer.jsonl --model chatgpt

python run_llm.py --source ./examples/test.jsonl --ra none --type post --outfile ./examples/post_confidence.jsonl --idx ./examples/confidence.jsonl --model chatgpt

If a strategy including punish (i.e., prior_punish/prior_pun_exp) is used during inference, --type should be set to post_punish.

Step4: Merge results and evaluate

python collect.py --mode evaluate --source ./data/nq_sample.jsonl --input ./examples/test.jsonl --output ./examples/test_new.jsonl --confidence ./examples/post_confidence.jsonl --answer ./examples/post_answer.jsonl

RAG

Static RAG

python run_llm.py --source data/nq_sample.jsonl --ra [sparse/dense/gold] --type qa --outfile ./examples/test_gold_static.jsonl --model chatgpt

For NQ, you can also specify --ra dpr as the gold documents. (We do this in our paper)

Adaptive RAG

There are also four steps. The steps are very similar to Inference.

Step1: Run run_llm.py to get the basic results
- Specify --ra [sparse/dense/gold] (dpr for NQ is ok)
Step2: Get the indices of samples that do not match the expected output format.
- The same
Step3: Generate the missing results for these samples.
- Specify --ra
Step4: To be continued

Note

You can find necessary commands in scripts/ and the demo data in examples/

The repository is continuously being updated.

Feel free to propose any issue.

ShiyuNee / When-to-Retrieve