symflower / eval-dev-quality

DevQualityEval: An evaluation benchmark 📈 and framework to compare and evolve the quality of code generation of LLMs.

Home Page:https://symflower.com/en/company/blog/2024/dev-quality-eval-v0.4.0-is-llama-3-better-than-gpt-4-for-generating-tests/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Evaluation task: Code repair

ruiAzevedo19 opened this issue · comments

Goal

Given source code with compilation errors, the model needs to repair the code such that the source code compiles. The response is validated by executing predefined tests making sure that the implementation itself is not altered.

PRs

Follow-up

TODOs

  • Testdata
    • Examples
      • function opening brackets are missing
      • type is missing
      • type is wrong
      • import is missing
      • variable is not declared
    • For each case:
      • generate test with symflower unit-tests
      • check the tests are passing
      • add a mistake to the implementation
      • commit
  • Implementation
    • Define a new task identifier: code-repair
    • For symflower model define this task as unsupported because we always generate deterministic tests
    • For LLM models
      • Define the new task as supported
      • Create an interface for tasks
        • Interface: Task
        • Methods
          • Run(repository) (assessment, err): run the task for the given repository and return the assessments
          • Identifier: returns the task identifier
      • Define tasks
        • TaskWriteTests
          • The Run method is basically what we already have in evaluate/repository.go:Evaluate
          • Remove evaluate/repository.go:Evaluate since is now part of the task
        • TaskCodeRepair
          • The Run method is responsible to only run the task for source code files (filter out test files and other files)
            • The method must range over the sub-directories in mistakes testdata and and run the code repair task for each sub-directory
            • Add two methods to the language interface
              • DefaultFileExtension returns the language file extension
              • DefaultTestFileSuffix returns the language test file suffix, i.e., _test.go for Go and Test.java for Java
              • Note: this will be used to easily filter out files
      • Calling the Run method
        • replace the call temporaryRepository.Evaluate(...) in evaluate/evaluate.go:Evaluate with the task Run method
          • We are ranging over temporaryRepository.Tasks so we need a function TaskForIdentifier(taskIdentifer) that, given a task identifier, return the task struct
  • Review and merge #197
  • Accommodate the code repair logic to changes made in #197