symflower / eval-dev-quality

DevQualityEval: An evaluation benchmark 📈 and framework to compare and evolve the quality of code generation of LLMs.

Home Page:https://symflower.com/en/company/blog/2024/dev-quality-eval-v0.4.0-is-llama-3-better-than-gpt-4-for-generating-tests/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Improve maintainability of assessments

ahumenberger opened this issue · comments

As of now assessments are stored in a nested map type AssessmentPerModelPerLanguagePerRepository map[model.Model]map[language.Language]map[string]metrics.Assessments, and this type is used as is for storing and retrieving assessments. This is super hard to maintain and extend. E.g for #165 we need to store assessments per task, extending the above nested map is super tedious since basically every test asserts such a map, and every single one of them has to be changed.

I'm proposing to make the usage independent of how the assessments are stored. My proposal includes a type

type AssessmentTuple struct {
  Model  model.Model
  Language language.Language
  Repository string
}

and a storage type

type AssessmentStore struct {
  ...
}

func (as *AssessmentStore) Store(tuple *AssessmentTuple) {
  ...
}

This makes it possible to specify a list of expected assessment tuples in a test without the need of dealing with AssessmentPerModelPerLanguagePerRepository. And this highly improves maintainability.

Related tasks:

So the tuple holds the information with which model/language/repository (and in the future: task) an assessment is associated? But where is the actual assessment stored?

So the tuple holds the information with which model/language/repository (and in the future: task) an assessment is associated? But where is the actual assessment stored?

Well the whole point is that this does not matter, AssessmentStore should take care of that. And for the time being, AssessmentStore will just use a nested map to store stuff. But callers should not care at all how it is stored.

I'm currently working on a version for thsi where AssessmentTuple is just used in testing, to make tests more maintainable. And AssessmentStore looks like this:

type AssessmentStore struct {
	store map[model.Model]map[language.Language]map[string]metrics.Assessments
}