terryyz / llm-benchmark

A list of LLM benchmark frameworks.

awesome-list benchmark evaluation llm

llm-benchmark

A list of comprehensive LLM evaluation frameworks. Contributions welcome!

Benchmark	Release Date	Repository	Paper/Blog	Dataset Number	Aspect	Licence
HELM	---	https://github.com/stanford-crfm/helm	Holistic Evaluation of Language Models	42	---	---
BIG-bench	---	https://github.com/google/BIG-bench	Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models	214	---	---
BigBIO	---	https://github.com/bigscience-workshop/biomedical	BigBio: A Framework for Data-Centric Biomedical Natural Language Processing	126	---	---
BigScience Evaluation	---	https://github.com/bigscience-workshop/evaluation	---	28	---	---
Language Model Evaluation Harness	---	https://github.com/EleutherAI/lm-evaluation-harness	Evaluating Large Language Models (LLMs) with Eleuther AI Evaluating LLMs	56	---	---
Scholar Evals	---	https://github.com/scholar-org/scholar-evals	---	---	---	---
Code Generation LM Evaluation Harness	---	https://github.com/bigcode-project/bigcode-evaluation-harness	---	13	---	---
Chatbot Arena	---	https://github.com/lm-sys/FastChat	---	---	---	---
GLUE	---	https://github.com/nyu-mll/jiant	---	11	---	---
SuperGLUE	---	https://github.com/nyu-mll/jiant	---	10	---	---
CLUE	---	https://github.com/CLUEbenchmark/CLUE	---	9	---	---
CodeXGLUE	---	https://github.com/microsoft/CodeXGLUE	---	10	---	---

About

A list of LLM benchmark frameworks.

awesome-list benchmark evaluation llm

Apache License 2.0