llm-edge / hal-9100

Edge full-stack LLM platform. Written in Rust

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

benchmark openai assistants vs open source assistants

louis030195 opened this issue · comments

End goal would be to have something like this:

OpenAI Assistants API Benchmark

Model Name Code Interpreter Retrieval Function Calling JSON Mode Tool Switching Speed
GPT-4 5 5 5 5 5 5
GPT-3.5 4 4 4 4 4 4

Open Source Assistants API Benchmark

Model Name Code Interpreter Retrieval Function Calling JSON Mode Tool Switching Speed
Mistral 7B 5 5 5 5 5 5
LAMMA2 3 3 3 3 3 4
LLaVA 4 4 4 4 4 4

next:

  • make sure process dont crash if request fial or something and write the json
  • more tests, different models, domains, use cases, etc.
  • scale it
  • make it human readable

also fix the result writer which write duplicate files

other idea is to just write a bunch of rows with "input" "output" "expected" and use best practice llm scoring:

https://github.com/openai/evals

since assistants is basically software 3.0 (foundation models) + software 1.0 hacks and plumbing - might also have a column which is extra context received by the LLM or something like this

if anyone has ideas on how to derive llm benchmarking best practice to this project 🙏