Use guardrails for code validation

Question

Use guardrails for code validation

yje-arch opened this issue 10 months ago · comments

Hi,

I am the maintainer of YiVal https://github.com/YiVal/YiVal, currently we are trying to use guardrails to help us generate valid python code.

To get started, I do a quick evaluation, here is the code:
https://github.com/YiVal/YiVal/blob/master/demo/guardrails/run_leetcode.py

I download 80 leetcode questions and ask gpt-3.5-turo to generate python code for it, the pass / fail will basically be judged from if eval can run. I just follow the colab here:
https://github.com/ShreyaR/guardrails/blob/main/docs/examples/bug_free_python_code.ipynb

I use the same prompt without guardrail as comparison,
Here is the results

as you can see, with guardrils, the failed rate is higher and we use more token compared to just using GPT raw api

I am wondering if I there is anything wrong that I am doing here. This could also help others who might have the same issue

Thanks in advance for taking a look!

zsimjee · Answer 1 · Mon Sep 18 2023 12:05:47 GMT+0800 (China Standard Time)

Hi,

Thanks for the detailed issue. The additional tokens may be from the gr. json prompt key. That adds a decent amount of weight to the prompt, do you have a comparison in which that is not used? I think that would be good data to collect. Another thing you could do to reduce token count is to use string-style validation instead of pydantic/structured validation. You can apply the BugFreePythonCode validator to a guard structure like this - https://docs.guardrailsai.com/defining_guards/strings/. With a little bit of prompt engineering, I think you can slim down your token count considerably using this approach. Would love to see results/help with this longer term! Feel free to post here or work with us on discord!

EZ · Answer 2 · Tue Sep 19 2023 06:23:47 GMT+0800 (China Standard Time)

Hi,
Thank you for your response!

I appreciate your understanding regarding the token usage, as retries naturally lead to more tokens being used.

However, I am particularly concerned about the quality of the output. As you can see from the attached image above, the simple code used to generate the Python code:

response = await openai.ChatCompletion.acreate(
    model="gpt-3.5-turbo",
    messages=messages,
    temperature=0,
    max_tokens=1000
)

appears to perform better than the code using guardrails:

guard = gd.Guard.from_pydantic(
    output_class=BugFreePythonCode, prompt=prompt_guardrail
)
raw_llm_response, validated_response = await guard(
    llm_api=openai.ChatCompletion.acreate,
    prompt_params={"leetcode_problem": leetcode_problem},
    model="gpt-3.5-turbo",
    max_tokens=1000,
    temperature=0,
    num_reasks=3,
)

The accuracy achieved using the simple openai api is 0.625, while it is 0.55 when using guardrails (The prompt is pretty much the same). This difference is significant and seems to contradict the purpose of using guardrails. Could you please help me understand if there is something I am missing or if there are any recommendations you could provide to improve the accuracy using guardrails?

Thank you for your time and assistance.

rafael · Answer 3 · Tue Sep 19 2023 22:17:36 GMT+0800 (China Standard Time)

LLMs are better at generating code in a block than in a JSON field. Try using Guard.from_string, targeting a string output type instead of Guard.from_pydantic which generates dicts.

EZ · Answer 4 · Wed Sep 20 2023 00:54:23 GMT+0800 (China Standard Time)

Thanks, I tried to use the following

      guard = gd.Guard.from_string(
           validators=[BugFreePython(on_fail="reask")], prompt=prompt, description="leetcode problem"
       )
       raw_llm_response, validated_response = await guard(
           llm_api=openai.ChatCompletion.acreate,
           num_reasks = 3
       )

And the validated_response response is not executable since it returns a string, underlying the BugFreePython use ast.parse, which will work for a string in python regardless , but it cannot be executed by exec()

an example response would be

    │ ╭─────────────────────────────────────────────────────────────────────────────────────────────────────────── Validated Output ────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ │
    │ │ 'Sure! Here\'s a Python code snippet that solves the problem:\n\n```python\ndef length_of_longest_substring(s):\n    # Create a dictionary to store the characters and their indices\n    char_map = {}\n    # Initialize variables to  │ │
    │ │ keep track of the starting index and the longest substring length\n    start = 0\n    max_length = 0\n    \n    # Iterate through the string\n    for i in range(len(s)):\n        # Check if the current character is already in the   │ │
    │ │ dictionary and its index is greater than or equal to the start index\n        if s in char_map and char_map[s] >= start:\n            # Update the start index to the next character after the repeated character\n            start =  │ │
    │ │ char_map[s] + 1\n        # Update the dictionary with the current character and its index\n        char_map[s] = i\n        # Update the max length if the current substring length is greater\n        max_length = max(max_length, i  │ │
    │ │ - start + 1)\n    \n    return max_length\n\n# Test the function with the given examples\ns1 = "abcabcbb"\nprint(length_of_longest_substring(s1))  # Output: 3\n\ns2 = "bbbbb"\nprint(length_of_longest_substring(s2))  # Output:       │ │
    │ │ 1\n\ns3 = "pwwkew"\nprint(length_of_longest_substring(s3))  # Output: 3\n```\n\nThis code snippet uses a sliding window approach to find the length of the longest substring without repeating characters. It keeps track of the        │ │
    │ │ starting index of the current substring and updates it whenever a repeated character is found. The function returns the maximum length encountered during the iteration.'

rafael · Answer 5 · Fri Sep 22 2023 00:58:47 GMT+0800 (China Standard Time)

Some more prompt engineering might help, like asking it in the prompt to only return a python code block without any surrounding text.

EZ · Answer 6 · Fri Sep 22 2023 02:05:13 GMT+0800 (China Standard Time)

Thanks, does guardrails naturally support it?

rafael · Answer 7 · Mon Sep 25 2023 19:00:58 GMT+0800 (China Standard Time)

This is the example we've got, though it generates the code as part of a JSON. https://docs.guardrailsai.com/examples/bug_free_python_code/#step-3-wrap-the-llm-api-call-with-guard

Have you tried using the prompt in that example, generating a string instead of a JSON?