benchmark openai assistants vs open source assistants
louis030195 opened this issue · comments
Louis Beaumont commented
End goal would be to have something like this:
OpenAI Assistants API Benchmark
Model Name | Code Interpreter | Retrieval | Function Calling | JSON Mode | Tool Switching | Speed |
---|---|---|---|---|---|---|
GPT-4 | 5 | 5 | 5 | 5 | 5 | 5 |
GPT-3.5 | 4 | 4 | 4 | 4 | 4 | 4 |
Open Source Assistants API Benchmark
Model Name | Code Interpreter | Retrieval | Function Calling | JSON Mode | Tool Switching | Speed |
---|---|---|---|---|---|---|
Mistral 7B | 5 | 5 | 5 | 5 | 5 | 5 |
LAMMA2 | 3 | 3 | 3 | 3 | 3 | 4 |
LLaVA | 4 | 4 | 4 | 4 | 4 | 4 |
Louis Beaumont commented
next:
- make sure process dont crash if request fial or something and write the json
- more tests, different models, domains, use cases, etc.
- scale it
- make it human readable
Louis Beaumont commented
Louis Beaumont commented
also fix the result writer which write duplicate files
Louis Beaumont commented
other idea is to just write a bunch of rows with "input" "output" "expected" and use best practice llm scoring:
https://github.com/openai/evals
since assistants is basically software 3.0 (foundation models) + software 1.0 hacks and plumbing - might also have a column which is extra context received by the LLM or something like this
if anyone has ideas on how to derive llm benchmarking best practice to this project 🙏