symflower / eval-dev-quality

DevQualityEval: An evaluation benchmark 📈 and framework to compare and evolve the quality of code generation of LLMs.

Home Page:https://symflower.com/en/company/blog/2024/dev-quality-eval-v0.4.0-is-llama-3-better-than-gpt-4-for-generating-tests/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Weight "executed code" more prominently

zimmski opened this issue · comments

In v0.5.0 eval run we have the problem that GPT-4 is better than Gemini 1.5 Flash. Gemini has more code that is executable, but GPT has a higher coverage score that is why it is better. However, it makes sense to first order by executable code than coverage. We need to balance:

  • Executable code should be weighted much higher
  • Coverage is still very important

@ahumenberger @bauersimon since this happens for multiple models in the overall score, we should fix this problem before we do the v0.6.0 run. Please discuss some solutions. Maybe we should even weight more metrics differently e.g. response-with-code is more important than response-no-error

The core problem we currently have is that we do not have normalized scores. This makes it inherently difficult to define fair and understandable weights for the different scoring categories. E.g. if we would define the weight for executable code to be 100 and assume there are two models A and B, and we assess two examples, one with 10 coverage objects and one with 1000 coverage objects.
Model A produces perfect coverage for example 1, and only executable code for example 2, so the score is 100 (executable example 1) + 10 * 10 (coverage example 1) + 100 (executable example 2) = 300
Model B does not provide executable code for example 1, but gets full coverage for example 1, so the score is 100 + 1000*10 = 10100

If example 2 would just have 2 coverage objects, then model A would still get 300 points, but model B would get 120 points.

This shows, IMO, very well that just playing around with weights does not help us. We need normalized scores e.g. between 0 and 100, and then it is much much easier and understandable to define and adjust weights.

Could avoid the weighting problem completely by not having one score but reporting everything separately. It feels almost impossible to break it down to just one single number and not run into problems.

Could avoid the weighting problem completely by not having one score but reporting everything separately. It feels almost impossible to break it down to just one single number and not run into problems.

Having an overall score is still valuable I think. Imagine you are trying to find the best LLM for your needs, and you have 5-10 factors on which you are trying to evaluate the LLMs, like "does code compile", "how much coverage", "time". And now there 100 different LLMs. In the end you need a ranking of the LLMs, where each factor has some weight.
However the weight always depends on the application, e.g. accuracy might be more relevant than speed or vice versa.,

So in the end it would be great if we could provide some kind of dash board where the user can rank factors by themselves, and provide weights for them. However, that still requires normalized scores.