benchmark openai assistants vs open source assistants

Question

benchmark openai assistants vs open source assistants

louis030195 opened this issue 9 months ago · comments

End goal would be to have something like this:

OpenAI Assistants API Benchmark

Model Name	Code Interpreter	Retrieval	Function Calling	JSON Mode	Tool Switching	Speed
GPT-4	5	5	5	5	5	5
GPT-3.5	4	4	4	4	4	4

Open Source Assistants API Benchmark

Model Name	Code Interpreter	Retrieval	Function Calling	JSON Mode	Tool Switching	Speed
Mistral 7B	5	5	5	5	5	5
LAMMA2	3	3	3	3	3	4
LLaVA	4	4	4	4	4	4

Louis Beaumont · Answer 1 · Fri Dec 29 2023 05:16:27 GMT+0800 (China Standard Time)

make sure process dont crash if request fial or something and write the json
more tests, different models, domains, use cases, etc.
scale it
make it human readable

Louis Beaumont · Answer 2 · Fri Dec 29 2023 08:55:11 GMT+0800 (China Standard Time)

https://gist.github.com/louis030195/3a937de928c553a0c6d9be3d92766c55 to finish

Louis Beaumont · Answer 3 · Fri Dec 29 2023 08:57:53 GMT+0800 (China Standard Time)

also fix the result writer which write duplicate files

Louis Beaumont · Answer 4 · Fri Jan 12 2024 05:16:49 GMT+0800 (China Standard Time)

other idea is to just write a bunch of rows with "input" "output" "expected" and use best practice llm scoring:

https://github.com/openai/evals

since assistants is basically software 3.0 (foundation models) + software 1.0 hacks and plumbing - might also have a column which is extra context received by the LLM or something like this

if anyone has ideas on how to derive llm benchmarking best practice to this project 🙏