OpenCompass (open-compass)

OpenCompass

open-compass

Organization data from Github https://github.com/open-compass

Location:China

Home Page:opencompass.org.cn

GitHub:@open-compass

Twitter:@OpenCompassX

OpenCompass's repositories

opencompass

OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.

Language:PythonLicense:Apache-2.0Stargazers:6266Issues:32Issues:753

VLMEvalKit

Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks

Language:PythonLicense:Apache-2.0Stargazers:3326Issues:9Issues:507

T-Eval

[ACL2024] T-Eval: Evaluating Tool Utilization Capability of Large Language Models Step by Step

Language:PythonLicense:Apache-2.0Stargazers:297Issues:2Issues:57

MMBench

Official Repo of "MMBench: Is Your Multi-modal Model an All-around Player?"

BotChat

Evaluating LLMs' multi-round chatting capability via assessing conversations generated by two LLM instances.

Language:Jupyter NotebookLicense:Apache-2.0Stargazers:155Issues:2Issues:2

GTA

[NeurIPS 2024 D&B Track] GTA: A Benchmark for General Tool Agents

Language:PythonLicense:Apache-2.0Stargazers:128Issues:6Issues:2

CompassJudger

The All-in-one Judge Models introduced by Opencompass

DevEval

A Comprehensive Benchmark for Software Development.

Language:PythonLicense:Apache-2.0Stargazers:113Issues:5Issues:2

MathBench

[ACL 2024 Findings] MathBench: A Comprehensive Multi-Level Difficulty Mathematics Evaluation Dataset

MMBench-GUI

Official repo of "MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents". It can be used to evaluate a GUI agent with a hierarchical manner across multiple platforms, including Windows, Linux, macOS, iOS, Android and Web.

Language:PythonStargazers:84Issues:0Issues:0

ANAH

[ACL 2024] ANAH & [NeurIPS 2024] ANAH-v2 & [ICLR 2025] Mask-DPO

Language:PythonLicense:Apache-2.0Stargazers:55Issues:2Issues:7

Ada-LEval

The official implementation of "Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks"

CompassVerifier

[EMNLP 2025] CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward

Language:Jupyter NotebookStargazers:51Issues:0Issues:0

CriticEval

[NeurIPS 2024] A comprehensive benchmark for evaluating critique ability of LLMs

Language:PythonLicense:Apache-2.0Stargazers:47Issues:3Issues:3

GPassK

[ACL 2025] Are Your LLMs Capable of Stable Reasoning?

ProSA

[EMNLP 2024 Findings] ProSA: Assessing and Understanding the Prompt Sensitivity of LLMs

Language:PythonLicense:Apache-2.0Stargazers:29Issues:3Issues:0

Creation-MMBench

Assessing Context-Aware Creative Intelligence in MLLMs

Language:JavaScriptStargazers:23Issues:0Issues:0

CIBench

Official Repo of "CIBench: Evaluation of LLMs as Code Interpreter "

Language:PythonLicense:Apache-2.0Stargazers:13Issues:2Issues:1
Stargazers:7Issues:0Issues:0

RaML

[Preprint 2025] Deciphering Trajectory-Aided LLM Reasoning: An Optimization Perspective

Language:Jupyter NotebookStargazers:6Issues:0Issues:0

human-eval

Code for the paper "Evaluating Large Language Models Trained on Code"

Language:PythonLicense:MITStargazers:3Issues:1Issues:0
Language:PythonLicense:Apache-2.0Stargazers:2Issues:2Issues:0
Language:PythonLicense:Apache-2.0Stargazers:2Issues:0Issues:0
Stargazers:0Issues:3Issues:0

hinode

A clean documentation and blog theme for your Hugo site based on Bootstrap 5

Language:HTMLLicense:MITStargazers:0Issues:0Issues:0
Stargazers:0Issues:0Issues:0
License:Apache-2.0Stargazers:0Issues:2Issues:0