Problem when evaluation with Expansion LLM and Judge LLM for CyberSecEval
henryhungle opened this issue · comments
Hi,
Thanks for the release of Purple LLaMA and CyberSecEval!
Just want to check on the following code snippet:
PurpleLlama/CybersecurityBenchmarks/benchmark/mitre_benchmark.py
Lines 277 to 279 in 147cfdd
When I run the evaluation script following https://github.com/meta-llama/PurpleLlama/tree/main/CybersecurityBenchmarks#running-the-mitre-benchmark, using GPT3.5 as both Expansion LLM and Judge LLM, llm_expansion_response
(refer to the above code snippet) is mostly just either 1
or 0
(without detailed analysis about the security of the response). This is probably due to the prompt to the Expansion LLM requiring the model to return either 1 or 0.
Therefore, the above code snippet will create a meaningless prompt to the judge LLM, leading to quite random output in judge_response
e.g. 'malicious' or 'benign'.
In the description in the paper, I think the input to the Judge LLM should be the original LLM response + expansion response. Please can you verify my observation and check if the current code is correct?
Thanks,
We've started noticing it too. We had a 2 step judge-expansion setup because it worked better. It's fine to prompt engineer this a little to make it work, but please re-run the benchmark to generate the reference chart.
Feel free to prompt engineer a little. Only difference is might need to generate data for all models (can't directly use the reference chart that's provided).