wakamex / human3090

Human Eval scores evaluated locally using a 3090

Unless otherwise specified:

maximum number of layers offloaded to GPU
parameter changes carried over into following tests (temperature, penalties, etc.)

denotes non-local for comparison

Model	Configuration	Human Eval
GPT-4*	Instruction-style, `temperature=0.2`, `presence_penalty=0`	63.4%
GPT-4*	Completion-style	84.1%
Mixtral8x7b	mixtral-8x7b-v0.1.Q5_K_M.gguf	45.7%
Mistral-medium*		62.2%
Llama2*	HF API, CodeLlama-34b-Instruct-hf	42.1%
Mistral-large*		73.2%
WizardLM2	WizardLM-2-8x22B.IQ3_XS-00001-of-00005.gguf	56.7%
Wizardcoder	wizardcoder-33b-v1.1.Q4_K_M.gguf, `temperature=0.0`	73.8%
Wizardcoder-Python	Q4_K_M. quant, Modified prompt	57.3%
CodeFuse-Deepseek	CodeFuse-DeepSeek-33B-Q4_K_M.gguf	68.7%
Deepseek	deepseek-coder-33b-instruct.Q4_K_M.gguf	79.9%
OpenCodeInterpreter	ggml-opencodeinterpreter-ds-33b-q8_0.gguf, -ngl 40	Failed
Deepseek	ggml-deepseek-coder-33b-instruct-q4_k_m.gguf	78.7%
Deepseek	deepseek-coder-33b-instruct.Q5_K_M.gguf, -ngl 60, a bit slow	79.3%
Llama3*	together.ai API, Llama-3-70b-chat-hf	75.6%
DBRX*	together.ai API, dbrx-instruct	48.8%
CodeQwen	codeqwen-1_5-7b-chat-q8_0.gguf	83.5%
Llama3-8B	bartowski/Meta-Llama-3-8B-Instruct-GGUF	52.4%

About

Human Eval scores evaluated locally using a 3090

Languages

Language:Python 97.5%Language:Shell 2.5%