The LLM (Large Language Model) Coding Ability Benchmark Suite is designed to evaluate the problem-solving and code generation capabilities of AI models, particularly in the domain of programming challenges. This suite comprises three main phases: Problem Definition, Solution Generation, and Grading. Each phase is structured to facilitate a robust evaluation of AI models, ensuring that the generated solutions not only solve the problems as required but also adhere to specified criteria and standards.
The benchmark suite takes problem sets in a defined structured format, provides each problem to an LLM, asks for code that solves the problem, then evaluates the solution on a variety of different factors. Separating these phases allows for a standardized approach to evaluating different AI models on a common ground of programming challenges.
Get started by running the benchmark.py
script. Here is an example using built-in problem sets, AIs, and graders:
OPENAI_API_KEY=<key> python benchmark.py --validate --generate --grade --model gpt-4 --grader performance correctness
This command:
- Validates the JSON for each problem set.
- Asks the OpenAI model
gpt-4
to generate a solution for each problem. - Grades the resulting solutions on both correctness and performance.
- Outputs individual scores and an overall grade to
stdout
and writes a report with this data in JSON format.
Example output:
GradingOutput (performance):
Overall Score: 0.9558801079752495
Solutions Count: 10
GradingOutput (correctness):
Overall Score: 0.9
Solutions Count: 10
Want to get started migrating your benchmark to the framework? Take a look at the migration guide.
The benchmarking suite contains sets of problems that the AI models will attempt to solve. Problems are defined in a structured JSON format, which encapsulates various details of a problem including a unique identifier, description, function prototype, and correctness test suite among others. Each problem can also have multiple prompts to guide the AI models in generating solutions.
For more details on the problem definition JSON format, see the full specification. Most fields are optional, but using as many fields as possible will make it easier to grade the solutions in a consistent manner.
Here is an example of a basic problem definition:
{
"identifier": "sum_two_integers",
"prompts": [
{
"prompt_id": "simple_prompt",
"prompt": "Write a function named 'add' that takes two integer arguments, 'a' and 'b', and returns their sum."
}
]
}
Each problem definition goes in a separate file in the problems
directory for the problem set.
Here is an example of a more complex problem description that asks an LLM to correct buggy code. It defines input code and test cases to be provided to the LLM, an optimal solution, and multiple prompts.
{
"identifier": "fix_multiplication",
"prompts": [
{
"prompt_id": "detailed_prompt",
"prompt": "The provided function 'multiply' attempts to return the product of two numbers, but it contains bugs. Debug and fix the function to return the correct product.",
"genericize": true,
"sample_inputs_outputs": [
{
"input": {"a": 3, "b": 4},
"expected_output": [12]
},
{
"input": {"a": -2, "b": 7},
"expected_output": [-14]
}
],
"input_code": "def multiply(a: int, b: int) -> int:\n return a + b # bug: incorrect operation"
},
{
"prompt_id": "brief_prompt",
"prompt": "Debug and fix the function 'multiply' to correctly return the product of two numbers.",
"genericize": false,
"sample_inputs_outputs": [
{
"input": {"a": 5, "b": 6},
"expected_output": [30]
},
{
"input": {"a": -1, "b": -1},
"expected_output": [1]
}
],
"input_code": "def multiply(a: int, b: int) -> int:\n return a / b # bug: incorrect operation"
}
],
"function_prototype": {
"function_name": "multiply",
"parameters": [
{
"name": "a",
"type": "int"
},
{
"name": "b",
"type": "int"
}
],
"return_values": [
{
"type": "int"
}
]
},
"correctness_test_suite": [
{
"input": {"a": 5, "b": 6},
"expected_output": [30]
},
{
"input": {"a": -1, "b": -1},
"expected_output": [1]
},
{
"input": {"a": 0, "b": 0},
"expected_output": [0]
},
{
"input": {"a": 10, "b": 10},
"expected_output": [100]
},
{
"input": {"a": -5, "b": 7},
"expected_output": [-35]
}
],
"optimal_solution": "def multiply(a: int, b: int) -> int:\n return a * b",
"tags": [
"Debugging",
"Arithmetic",
"Easy"
]
}
In the Solution Generation phase, AI models generate solutions for the defined problems.
The problem definition is converted to a prompt like the following:
The provided function `add` attempts to return the sum of two integers, but it contains bugs. Debug and fix the function to return the correct sum.
Function Signature:
add(a: int, b: int) -> int
Sample Inputs and Outputs:
Test Case 1:
Input: a = 5, b = 6
Expected Output: 11
Test Case 2:
Input: a = 8, b = 3
Expected Output: 11
Input Code:
def add(a: int, b: int) -> int:
return a - b
The test suite has two built-in queriers:
OpenAIModelQuerier
, which uses the OpenAI API to interact with any model supported by the APIHumanAIModelQuerier
, which provides prompts at the command line to be copied and pasted into LLMs.
The test suite will determine which querier to use based on the model name passed in. It first checks it to see if the OpenAI API handles it; if not, it falls back to the human querier. Running python benchmark.py -h
will show the supported OpenAI models.
Adding new queriers is straightforward. Simply extend the abstract base class AIModelQuerier
and implement the generate_solution
method to provide logic for generating solutions. The LLMProblemInput
class is used to encapsulate the input data for the AI models, while the LLMSolution
class is used to encapsulate the generated solutions.
Example:
from llm_types import AIModelQuerier, LLMProblemInput, LLMSolution
class MyQuerier(AIModelQuerier):
def __init__(self, model_identifier: str):
super().__init__(model_identifier)
@classmethod
def supported_model_names(cls):
# ... (determine which model names are supported by this querier)
pass
def generate_solution(self, problem_input: LLMProblemInput) -> 'LLMSolution':
# ... (implementation of generate_solution)
pass
my_querier = MyQuerier(model_identifier="claude")
solution = my_querier.generate_solution(problem_input)
After the querier returns solutions for the provided problems, the resulting LLMSolution
has the following serialized format. For more details on the JSON format, see the full specification.
{
"problem_identifier": "problem_1",
"model_identifier": "text-davinci-002",
"prompt_identifier": "brief_prompt",
"solution_code": "\"\"\"\n\ndef add(a, b):\n # fill in your code here\n return a + b",
"feedback": null
}
The Grading phase evaluates the generated solutions against the defined problems and grading criteria. The SolutionGrade
class captures the grading details for each solution, and the GradingOutput
class encapsulates the overall grading output. A Grader
abstract base class provides the interface that subclasses implement for grading the solutions, which can be customized to suit different grading criteria and requirements.
The test suite includes correctness
and performance
graders; new graders can be added by creating new subclasses of Grader
.
Example:
from llm_types import Grader, SolutionGrade, GradingOutput
class MyGrader(Grader):
def grade(self, problem_definitions, solutions) -> GradingOutput:
# ... (implementation of grading logic)
pass
my_grader = MyGrader()
grading_output = my_grader.grade(problem_definitions, solutions)
Here is an example grading JSON output. For more details on the JSON format, see the full specification.
{
"problem_identifier": "problem_1",
"prompt_identifier": "detailed_prompt",
"model_identifier": "text-davinci-002",
"score": 0.8,
"sub_criteria_scores": null,
"issues": [
"Test failed: Input: a = 1, b = 2\nExpected Output: 3 Result: {'exception': 'EOF while scanning triple-quoted string literal (<string>, line 5)', 'traceback': 'Traceback (most recent call last):\\n File \"/Users/morgang/code/LLMCodingBenchmarkingFramework/execution.py\", line 27, in executor_script\\n exec(function_code, exec_globals)\\n File \"<string>\", line 5\\n \"\"\"\\n\\ndef add(a, b):\\n # fill in your code here\\n return a + b\\n \\n\\n \\n \\n ^\\nSyntaxError: EOF while scanning triple-quoted string literal\\n', 'parameters': [1, 2], 'function_code': '\"\"\"\\n\\ndef add(a, b):\\n # fill in your code here\\n return a + b'}"
]
Each time the grading system invoked, a report will be generated and written to the reports
folder within the root directory by default. To change the location of the report storage, use the --report_path
argument. For each model, a report is generated, containing scores from all test cases graded during that particular grading process. Each report also contains average scores for each problem set (for example, "basic" and "bugfixing" represent two sample problem sets currently in this repo) and average scores for each grading criterion (for example, "correctness" or "performance"). The reports are distinguished by timestamp and name of model.
See our full (migration guide)[migration_guide.md] for details on how to migrate existing problem sets and benchmarks to this framework.
The framework separates different phases, models, and graders into distinct directories and files.
The directory structure of the Benchmarking Framework is as follows:
- problem_sets:
- <problem_set_name>:
- problems:
- Contains JSON files for each problem (e.g.,
problem_1.json
).
- Contains JSON files for each problem (e.g.,
- solutions:
- Contains JSON files for solutions to each problem, named the same as the problems (e.g.,
problem_1.json
). - <model_name>:
- Each problem has its own directory (e.g.,
problem_1
), within which there is a JSON file for each prompt with the model's solution for that prompt.
- Each problem has its own directory (e.g.,
- Contains JSON files for solutions to each problem, named the same as the problems (e.g.,
- grades:
- text-davinci-002:
- <grader:
- Each problem has its own directory (e.g.,
problem_1
), within which there is a JSON file for each prompt with the grading information for that prompt.
- Each problem has its own directory (e.g.,
- <grader:
- text-davinci-002:
- problems:
- bugfixing:
- Similar structure to the
basic
directory but tailored for bugfixing problems.
- Similar structure to the
- <problem_set_name>:
- reports: Contains a JSON report for each run of the testing framework in the format described above.
Some additional notes:
- Each problem set (e.g.,
basic
,bugfixing
) has a similar structure containing directories for grades, problems, and solutions. - The
problems
directory holds the problem definition files in JSON format for each problem within the problem set. - The
solutions
directory contains sub-directories for different models, with each problem having its own folder containingbrief_prompt.json
anddetailed_prompt.json
files, likely representing different levels of solution details. - Within the
grades
directory, different AI models (e.g.,text-davinci-002
) have their own sub-directories, further subdivided by grader (e.g.correctness
andperformance
). Each problem within these sub-directories has a dedicated folder containing a JSON file for each prompt that holds grading details.
This structured setup facilitates the organization of problems, solutions, and grading details, making it easier to manage and navigate the benchmarking framework.