meta-llama / PurpleLlama

Set of tools to assess and improve LLM security.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Precision/Recall Curves

normster opened this issue · comments

Thank you for releasing Llama Guard 2, it looks like a very promising model!

I was wondering if it would be feasible to release precision/recall curves or numbers by harm category, for your internal benchmark evaluation? Or is there any hope of publicly releasing a small labeled test set for the community to evaluate for ourselves?

From Table 2 in the model card, it looks like a classification threshold of 0.5 results in a rather high FNRs for some categories and I'd like to use a classification threshold with more balanced errors, but am not sure how to go about tuning it myself because the new MLCommons harm taxonomy doesn't map 1:1 with public content classification datasets like OpenAI's moderation dataset.

Hi there,

We did not plot the precision/recall curves and don't have enough bandwidth to recompute the metrics.

I think what we can do here is
(1) Use the beavertails test set to do the measurement. However, note that beavertails taxonomy is not fully aligned with our taxonomy and the annotation guideline could be different. Thus, the calibration might not be correct.
(2) Since our taxonomy is aligned with MLCommons, MLCommons might release a test set for benchmarking in the future. Stay tuned for it.