promptfoo / promptfoo

Test your prompts, agents, and RAGs. Use LLM evals to improve your app's quality and catch problems. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.

Home Page:https://www.promptfoo.dev/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Docs request: where does histogram come from?

jamesbraza opened this issue · comments

I have three possible scores: 0, 0.1, and 1 for a Python assertion, and two basic assertions.

providers:
  - openai:chat:gpt-4-0613
  - openai:chat:gpt-4-turbo-2024-04-09
  - anthropic:messages:claude-3-sonnet-20240229
defaultTest:
  assert:
    - description: was answered
      type: not-icontains
      value: cannot answer
    - description: has sentences
      type: javascript
      value: output.length > 20
    - description: check value
      type: python
      value: file://assert.py

At the top of my promptfoo view, I see bins around 0.6 and 0.7, which isn't quite making sense to me:

screenshot of histogram

The request is, can we add a little description such that this figure is easy to understand.

  • I have three different model providers, is that where Prompt 1 (red), Prompt 2 (blue), and Prompt 3 (green) come from?
  • Why does the histogram show scores of 0.6 and 0.7? Is that like a sum of multiple assertions' scores?

I now understand that I have three assertions:

  • Two binary ones: can be score 0 or 1
  • One custom assertion: can be score 0, 0.1, 1

I realized the histogram plots mean score: 0.7 = (1 + 1 + 0.1) / 3

That being said, I still think perhaps promptfoo can add a little info bubble or hover-over/tooltip that explains this.

Feel free to close this out if uninterested