ShiyuNee / When-to-Retrieve

Implementation of "ACL'24: When Do LLMs Need Retrieval Augmentation? Mitigating LLMs’ Overconfidence Helps Retrieval Augmentation"

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

When to Retrieve

Basic Usage

There are four steps to get the desired responses.

Inference

  • Step1: Run run_llm.py to get the basic results

    python run_llm.py --source data/nq_sample.jsonl --ra none --type prior --outfile ./examples/test.jsonl --model chatgpt
    • You can specify --type [qa/qa_explain/qa_cot/qa_gene/prior/prior_punish/prior_explain/prior_pun_exp/prior_cot/prior_gene]
  • Note

    • The desired format of the response is [Answer, Confidence]. For example, when the question is What is the capital of Chain, we expect the response to be "Beijing, Certain"
    • LLMs might produce responses that don't adhere to the expected format. It could simply provide the answer Beijing or the confidence Uncertain.
    • We perform post-processing to obtain the desired output format.

Post-process & Evaluate

  • Step2: Get the indices of samples that do not match the expected output format.

    python collect.py --mode preprocess --source ./data/nq_sample.jsonl --input ./examples/test.jsonl --output ./examples/test_new.jsonl --confidence ./examples/confidence.jsonl --answer ./examples/answer.jsonl --model chatgpt
  • Step3: Generate the missing results for these samples.

    python run_llm.py --source data/nq_sample.jsonl --ra none --type qa --outfile ./examples/post_answer.jsonl --idx ./examples/answer.jsonl --model chatgpt
    
    python run_llm.py --source ./examples/test.jsonl --ra none --type post --outfile ./examples/post_confidence.jsonl --idx ./examples/confidence.jsonl --model chatgpt
    • If a strategy including punish (i.e., prior_punish/prior_pun_exp) is used during inference, --type should be set to post_punish.
  • Step4: Merge results and evaluate

    python collect.py --mode evaluate --source ./data/nq_sample.jsonl --input ./examples/test.jsonl --output ./examples/test_new.jsonl --confidence ./examples/post_confidence.jsonl --answer ./examples/post_answer.jsonl

RAG

Static RAG

python run_llm.py --source data/nq_sample.jsonl --ra [sparse/dense/gold] --type qa --outfile ./examples/test_gold_static.jsonl --model chatgpt
  • For NQ, you can also specify --ra dpr as the gold documents. (We do this in our paper)

Adaptive RAG

There are also four steps. The steps are very similar to Inference.

  • Step1: Run run_llm.py to get the basic results

    • Specify --ra [sparse/dense/gold] (dpr for NQ is ok)
  • Step2: Get the indices of samples that do not match the expected output format.

    • The same
  • Step3: Generate the missing results for these samples.

    • Specify --ra
  • Step4: To be continued

Note

You can find necessary commands in scripts/ and the demo data in examples/

The repository is continuously being updated.

Feel free to propose any issue.

About

Implementation of "ACL'24: When Do LLMs Need Retrieval Augmentation? Mitigating LLMs’ Overconfidence Helps Retrieval Augmentation"


Languages

Language:Python 97.1%Language:Shell 2.9%