Retreival QA Benchmark (RQABench in short) is an open-sourced, end-to-end test workbench for Retrieval Augmented Generation (RAG) systems. We intend to build an open benchmark for all developers and researchers to reproduce and design new RAG systems. We also want to create a platform for everyone to share their lego blocks to help others to build up their own retrieval + LLM system.
Here are some major feature of this benchmark:
- Flexibility: We maximize the flexibility when design your retrieval system, as long as your transform accept
QARecord
as input andQARecord
as output. - Reproducibility: We gather all settings in the evaluation process into a single YAML configuration. It helps you to track and reproduce experiements.
- Traceability: We collect more than the accuracy and scores. We also focus on running times on any function you want to watch and the tokens used in the whole RAG system.
# Clone to your local machine
git clone https://github.com/myscale/Retrieval-QA-Benchmark
# install it as editable package
cd Retrieval-QA-Benchmark && python3 -m pip3 install -e .
from retrieval_qa_benchmark.models import *
from retrieval_qa_benchmark.datasets import *
from retrieval_qa_benchmark.transforms import *
from retrieval_qa_benchmark.evaluators import *
from retrieval_qa_benchmark.utils.profiler import PROFILER
# This is for loading our special yaml configuration with `!include` keyword
from retrieval_qa_benchmark.utils.config import load
# This is where you can contruct evaluator from config
from retrieval_qa_benchmark.utils.factory import EvaluatorFactory
# This will print all loaded modules. You can also use it as reference to edit your configuration
print(str(REGISTRY))
# Choose a configuration to evaluatoe
config = load(open("config/mmlu.yaml"))
evaluator = EvaluatorFactory.from_config(config).build()
# evaluator will return accuracy in float and list of `QAPrediction`
acc, result = evaluator()
# you can set out_file to generate a JSONL file or write it as your own.
with open("some-file-name-to-store-result.jsonl", "w") as f:
f.write("\n".join([r.model_dump_json() for r in result]))
- RAG with FAISS
- Download the index file for wikipedia here (around 26G).
- Download dataset from huggingface with our code (around 140G). It will automatically download the dataset for the first time.
- Set the index path to the download index.
- RAG with MyScale
- Download the data for wikipedia in parquet here.
- Insert the data and create vector index. You can also directly use our free pod hosting the Wikipedia data as described here.
Setup | Dataset | Average | |||||
---|---|---|---|---|---|---|---|
LLM | Contexts | mmlu-astronomy | mmlu-prehistory | mmlu-global-facts | mmlu-college-medicine | mmlu-clinical-knowledge | |
gpt-3.5-turbo | ❌ | 71.71% | 70.37% | 38.00% | 67.63% | 74.72% | 68.05% |
✅ (Top-1) |
75.66% (+3.95%) |
78.40% (+8.03%) |
46.00% (+8.00%) |
67.05% (-0.58%) |
73.21% (-1.51%) |
71.50% (+3.45%) |
|
✅ (Top-3) |
76.97% (+5.26%) |
81.79% (+11.42%) |
48.00% (+10.00%) |
65.90% (-1.73%) |
73.96% (-0.76%) |
72.98% (+4.93%) |
|
✅ (Top-5) |
78.29% (+6.58%) |
79.63% (+9.26%) |
42.00% (+4.00%) |
68.21% (+0.58%) |
74.34% (-0.38%) |
72.39% (+4.34%) |
|
✅ (Top-10) |
78.29% (+6.58%) |
79.32% (+8.95%) |
44.00% (+6.00%) |
71.10% (+3.47%) |
75.47% (+0.75%) |
73.27% (+5.22%) |
|
llama2-13b-chat-q6_0 | ❌ | 53.29% | 57.41% | 33.00% | 44.51% | 50.19% | 50.30% |
✅ (Top-1) |
58.55% (+5.26%) |
61.73% (+4.32%) |
45.00% (+12.00%) |
46.24% (+1.73%) |
54.72% (+4.53%) |
55.13% (+4.83%) |
|
✅ (Top-3) |
63.16% (+9.87%) |
63.27% (+5.86%) |
49.00% (+16.00%) |
46.82% (+2.31%) |
55.85% (+5.66%) |
57.10% (+6.80%) |
|
✅ (Top-5) |
63.82% (+10.53%) |
65.43% (+8.02%) |
51.00% (+18.00%) |
51.45% (+6.94%) |
57.74% (+7.55%) |
59.37% (+9.07%) |
|
✅ (Top-10) |
65.13% (+11.84%) |
66.67% (+9.26%) |
46.00% (+13.00%) |
49.71% (+5.20%) |
57.36% (+7.17%) |
59.07% (+8.77%) |
|
* The benchmark uses MyScale MSTG as vector index * This benchmark can be reproduced with our github repository retrieval-qa-benchmark |
Setup | Dataset | Average | |||||
---|---|---|---|---|---|---|---|
LLM | Contexts | mmlu-astronomy | mmlu-prehistory | mmlu-global-facts | mmlu-college-medicine | mmlu-clinical-knowledge | |
gpt-3.5-turbo | ❌ | 71.71% | 70.37% | 38.00% | 67.63% | 74.72% | 68.05% |
✅ (Top-1) |
75.00% (+3.29%) |
77.16% (+6.79%) |
44.00% (+6.00%) |
66.47% (-1.16%) |
73.58% (-1.14%) |
70.81% (+2.76%) |
|
✅ (Top-3) |
75.66% (+3.95%) |
80.25% (+9.88%) |
44.00% (+6.00%) |
65.90% (-1.73%) |
73.21% (-1.51%) |
71.70% (+3.65%) |
|
✅ (Top-5) |
78.29% (+6.58%) |
79.32% (+8.95%) |
46.00% (+8.00%) |
65.90% (-1.73%) |
73.58% (-1.14%) |
72.09% (+4.04%) |
|
✅ (Top-10) |
78.29% (+6.58%) |
80.86% (+10.49%) |
49.00% (+11.00%) |
69.94% (+2.31%) |
75.85% (+1.13%) |
74.16% (+6.11%) |
|
llama2-13b-chat-q6_0 | ❌ | 53.29% | 57.41% | 33.00% | 44.51% | 50.19% | 50.30% |
✅ (Top-1) |
57.89% (+4.60%) |
61.42% (+4.01%) |
48.00% (+15.00%) |
45.66% (+1.15%) |
55.09% (+4.90%) |
55.22% (+4.92%) |
|
✅ (Top-3) |
59.21% (+5.92%) |
65.74% (+8.33%) |
50.00% (+17.00%) |
50.29% (+5.78%) |
56.98% (+6.79%) |
58.28% (+7.98%) |
|
✅ (Top-5) |
65.79% (+12.50%) |
64.51% (+7.10%) |
48.00% (+15.00%) |
50.29% (+5.78%) |
58.11% (+7.92%) |
58.97% (+8.67%) |
|
✅ (Top-10) |
65.13% (+11.84%) |
66.05% (+8.64%) |
48.00% (+15.00%) |
47.40% (+2.89%) |
56.23% (+6.04%) |
58.38% (+8.08%) |
|
* The benchmark uses FAISS IVFSQ (nprobes=128) as vector index * This benchmark can be reproduced with our github repository retrieval-qa-benchmark |