symflower / eval-dev-quality

DevQualityEval: An evaluation benchmark 📈 and framework to compare and evolve the quality of code generation of LLMs.

Home Page:https://symflower.com/en/company/blog/2024/dev-quality-eval-v0.4.0-is-llama-3-better-than-gpt-4-for-generating-tests/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Isolation of evaluations

Munsio opened this issue · comments

For going forward we need to isolate the evaluation runs. This will allow us in the end to run evaluations of multiple models in parallel on a single host or in a cluster.

1 Iteration:

  • Create a Docker image
    • Contains eval-dev-quality binary
    • Has all the necessary prerequisites installed from an archive with fixed versions
      • Java
      • Maven
      • Gradle
      • Go
      • eval-dev-quality install-all
  • Documentation
    • How to build locally (bash script)
    • How to run (bash script)
    • Script to run multiple local docker instances simultaneously
    • Script to run multiple instances inside kubernetes

2 Iteration:

  • Build the image on each PR (+ main) and publish it on Github registry
  • Add an additional option --runtime docker (default is "local" which runs as before)
    • If specified each model will be run inside a docker container locally
  • Add an additional option --parallel $uint (default is "1")
    • The --parallel defines how many models are running in parallel
    • The option is only allowed if the runtime != local
    • Add a check for --sequential to be only allowed when runtime == local
    • Print an information that --sequential is skipped if runtime != local but passed on to the subsequent runs

3 Iteration:

  • Add an additional runtime kubernetes
  • Runs all the models simultaneously on a Kubernetes cluster
  • Uses the local installed kubectl cmd and default context

4 Iteration:

  • Fetch data from kubernets runtime automatically (optional parameter)
  • Merge the results from runtimes docker and kubernetes back into into a summary This will be solved by #205

5 Iteration:

  • Support Ollama in container