symflower / eval-dev-quality

DevQualityEval: An evaluation benchmark 📈 and framework to compare and evolve the quality of code generation of LLMs.

Home Page:https://symflower.com/en/company/blog/2024/dev-quality-eval-v0.4.0-is-llama-3-better-than-gpt-4-for-generating-tests/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Apply symflower fix to a "write-test" result of a model

bauersimon opened this issue · comments

Basically we want to execute the "write-test" task, but then optionally call symflower repair on the generated tests.
Plus, the scoring should treat both the original "write-test" and the (hopefully) repaired tests as different results.

  • introduce a new TaskIdentifier write-test-symflower-repair that represents the write test task but with symflower applied
  • let task.Run return map[TaskIdentifier]Assessments such that it can return both the unfixed, and fixed results
  • extend the "write-test" task such that
    • if the first result fails at execution apply symflower repair and executes again
    • write the results of symflower repair into the original model's log, i.e. with symflower repair: .......
    • watch out for the scoring... the score must be combined intelligently
      • runtime probably has to be added up
      • for coverage and execution only the final "attempt" matters
      • ...
  • symflower fix needs to be merged

@Munsio plz review

  • When we want to "repair" something don't we need the data from the previous model? How is it shared?
  • How does the Next Model in the chain know that it needs to repair something and not generate the thing new?

The task evaluation logic can carry over the "broken code" and just ask the next model to do a "repair". The models themselves don't need to worry about sharing context or what they need to do. The evaluation logic will say "you failed, but you, please fix, this is what we have so far".