promptfoo / promptfoo

Test your prompts, agents, and RAGs. Redteaming, pentesting, vulnerability scanning for LLMs. Improve your app's quality and catch problems. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.

Home Page:https://www.promptfoo.dev/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

llm-rubric with VertexAI Gemini Pro

romaintoub opened this issue · comments

Hey @typpo

I've make it work with the python assertion but it's working 80% of the time as its using stdout from the main function, it is very sensible to any error or bugs so I want to go back to the first option that was working initially and I think I have some good info on why its not working as expected. My config:

# The LLM output will be graded by gemini-pro
defaultTest:
  options:
    provider: vertex:gemini-pro
    config:
      temperature: 0
      maxOutputTokens: 1024
    rubricPrompt:
      - role: system
        content: >-
          You are evaluating the answer...

and i ran in debug mode I found this:

Cache is disabled.
Computing file hash for script evaluator/providers/eval_llm_chain.py
Running python script evaluator/providers/eval_llm_chain.py with scriptPath providers/eval_llm_chain.py and args: Why the sky is blue?
[object Object]
[object Object]
Python script evaluator/providers/eval_llm_chain.py returned: Importing module eval_llm_chain from evaluator/providers ...
{"type": "final_result", "data": {"output": "I can't answer that question."}}
Coerced JSON prompt to Gemini format: [{"role":"system","content":"You are evaluating the answer of an assistant.\n    ## Your turn!\n  Target: Sunlight reaches Earth's atmosphere and is scattered in all directions by all the gases and particles in the air. Blue light is scattered more than the other colors because it travels as shorter, smaller waves. This is why we see a blue sky most of the time.\n  Assistant: I can't answer that question."},{"role":"user","content":"Output: I can't answer that question."}]
Preparing to call Google Vertex API (Gemini) with body: {"contents":{"role":"user","parts":{"text":"You are evaluating the answer of an assistant.\n  same thing "}},"generationConfig":{}}
Gemini API response:
**
    [
        {"candidates":[{"content":{"role":"model","parts":[{"text":"{"}]}}]},
        {"candidates":[{"content":{"role":"model","parts":[{"text":" \"pass\": false, \"score\": 0.0, \"reason\": \"The target response should not talk about off-topic discussions"}]},"safetyRatings":[{"category":"HARM_CATEGORY_HATE_SPEECH","probability":"NEGLIGIBLE","probabilityScore":0.24095316,"severity":"HARM_SEVERITY_NEGLIGIBLE","severityScore":0.14977993},{"category":"HARM_CATEGORY_DANGEROUS_CONTENT","probability":"NEGLIGIBLE","probabilityScore":0.091220066,"severity":"HARM_SEVERITY_NEGLIGIBLE","severityScore":0.13139598},{"category":"HARM_CATEGORY_HARASSMENT","probability":"NEGLIGIBLE","probabilityScore":0.25870034,"severity":"HARM_SEVERITY_NEGLIGIBLE","severityScore":0.15140383},{"category":"HARM_CATEGORY_SEXUALLY_EXPLICIT","probability":"NEGLIGIBLE","probabilityScore":0.091220066,"severity":"HARM_SEVERITY_NEGLIGIBLE","severityScore":0.07544843}]}]},
        {"candidates":[{"content":{"role":"model","parts":[{"text":".\"} similar answer to the target response.\"\n}"}]},"finishReason":"STOP","safetyRatings":[{"category":"HARM_CATEGORY_HATE_SPEECH","probability":"NEGLIGIBLE","probabilityScore":0.20291664,"severity":"HARM_SEVERITY_NEGLIGIBLE","severityScore":0.12656909},{"category":"HARM_CATEGORY_DANGEROUS_CONTENT","probability":"NEGLIGIBLE","probabilityScore":0.058024753,"severity":"HARM_SEVERITY_NEGLIGIBLE","severityScore":0.13523208},{"category":"HARM_CATEGORY_HARASSMENT","probability":"NEGLIGIBLE","probabilityScore":0.24220563,"severity":"HARM_SEVERITY_NEGLIGIBLE","severityScore":0.16013464},{"category":"HARM_CATEGORY_SEXUALLY_EXPLICIT","probability":"NEGLIGIBLE","probabilityScore":0.05623635,"severity":"HARM_SEVERITY_NEGLIGIBLE","severityScore":0.06816437}]}],
        "usageMetadata":{"promptTokenCount":482,"candidatesTokenCount":40,"totalTokenCount":522}}
]
**
Eval #1 complete (1 of 1)

And the error I have in prompt view is:

llm-rubric produced malformed response: { "pass": false, "score": 0.0, "reason": "The target response should not talk about off-topic discussions."} similar answer to the target response."}

It looks like the response given from gemini is broken down in 3 different parts and the last part is not necessary and is causing the parsing issue as its adding some text after the dictionary
do you know if this could come from the VertexAI API and be fixed here?
Also, should the gemini config be integrated into the generationConfig attribute in the body? Normally I use gemini-1.0-pro with temperature = 0

Just pushed two changes that should help address this problem.

  1. Try changing your config to:

    config:
      generationConfig:
        temperature: 0
    

    Google nests their temperature and similar under the generationConfig key. I may make a change to automatically correct this (because I'm sure you're not the only one), but as a temporary workaround try adding generationConfig.

  2. Updated llm-rubric to be more resilient to JSON responses with additional text before or after.

thanks! I will test it out

it works now, thanks a lot!