ROLEBENCH

ROLEBENCH is a framework for evaluating the performance of Role-Prompting across different datasets and Large Language Models.

Have a quick run 🏃

Supported models

Llama3-8B Instruct
Phi-3 mini-4K Instruct
Mistral-7B Instruct
Gemma-7B Instruct

Datasets

BoolQ (validation split - 3270 samples)
COMMONSENSEQA (validation split - 1221 samples)
iwslt2017-en-fr dataset (validation split - 890 samples)
samsum dataset (test split - 819 samples)

Prompt Template

BoolQ - Based on the passage:'{passage}'\nAnswer True/False to the question: '{question}' as an Omniscient person.

COMMONSENSEQA - Choose the answer as a critical thinker.\n{question}\n{opt1}. {text1}\n{opt2}. {text2}\n{opt3}. {text3}\n{opt4}. {text4}\n{opt5}. {text5}

IWSLT2017en-fr - Translate '{eng_text}' to french as a Translator.

SamSum - Summarise the Dialogue: {dialogue} as a Storyteller.

Results

Model	BoolQ	COMMONSENSEQA	IWSLT2017en-fr	SamSum
Llama3	Accuracy = 0.8507 F1 score = 0.8793	Accuracy = 0.7371	BLEU = 0.2399 METEOR = 0.5436	Rouge1 = 0.1725 RougeL = 0.1229
Llama3
Phi-3	Accuracy = 0.8113 F1 score = 0.8344	Accuracy = 0.7068	BLEU = 0.1928 METEOR = 0.4950	Rouge1 = 0.1383 RougeL = 0.0951
Phi-3
Mistral-7B	Accuracy = 0.8281 F1 score = 0.8548	Accuracy = 0.6490	BLEU = 0.1507 METEOR = 0.4763	Rouge1 = 0.1359 RougeL = 0.0991
Mistral-7B
Gemma-7B	Accuracy = 0.6288 F1 score = 0.5831	Accuracy = 0.6288	BLEU = 0.0940 METEOR = 0.3611	Rouge1 = 0.1192 RougeL = 0.0793
Gemma-7B

Repository Structure

_llama3_role_all.ipynb -- Role prompting on all datasets using Llama3-8B Instruct model
|
|_phi3_role_all.ipynb -- Role prompting on all datasets using Phi-3 mini-4K Instruct model
|
|_mistral_role_all.ipynb -- Role prompting on all datasets using Mistral-7B Instruct model
|
|_Gemma_role_all.ipynb  -- Role prompting on all datasets using Gemma-7B Instruct model
|
|_Role_prompting____quantitaive_analysis.txt 
                  |_qualitative_analysis.txt

Contribution

The project will always remain OPEN-SOURCE, further contributions involving new models and datasets, formulating new roles in the prompt templates are always welcome.

References

if you find this work useful, please cite this repository:

@software{Budagam_ROLEBENCH-_A_Role_2024,
author = {Budagam, Devichand},
month = may,
title = {{ROLEBENCH- A Role Prompting Benchmark}},
url = {https://github.com/devichand579/ROLEBENCH},
year = {2024}
}

About

ROLEBENCH- A Role Prompting Benchmark

MIT License

Languages

Language:Jupyter Notebook 100.0%