Beast code in Giters

OpenCompass's repositories

opencompass

OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.

Language:PythonApache-2.02899 21 369

MixtralKit

A toolkit for inference and evaluation of 'mixtral-8x7b-32kseqlen' from Mistral AI

Language:PythonApache-2.0760 9 16

VLMEvalKit

Open-source evaluation toolkit of large vision-language models (LVLMs), support GPT-4v, Gemini, QwenVLPlus, 50+ HF models, 20+ benchmarks

Language:PythonApache-2.0513 8 69

LawBench

Benchmarking Legal Knowledge of Large Language Models

Language:PythonApache-2.0190 9 11

T-Eval

[ACL2024] T-Eval: Evaluating Tool Utilization Capability of Large Language Models Step by Step

Language:PythonApache-2.0168 2 45

BotChat

Evaluating LLMs' multi-round chatting capability via assessing conversations generated by two LLM instances.

Language:Jupyter NotebookApache-2.0104 2 1

MMBench

Official Repo of "MMBench: Is Your Multi-modal Model an All-around Player?"

Apache-2.0103 4 26

DevBench

A Comprehensive Benchmark for Software Development.

Language:PythonApache-2.073 4 2

MathBench

[ACL 2024 Findings] MathBench: A Comprehensive Multi-Level Difficulty Mathematics Evaluation Dataset

Apache-2.060 2 8

Ada-LEval

The official implementation of "Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks"

Language:Python43 4 2

CriticBench

A comprehensive benchmark for evaluating critique ability of LLMs

Language:PythonApache-2.021 30

code-evaluator

A multi-language code evaluation tool.

Language:PythonApache-2.016 30

OpenFinData

Apache-2.015 3 1

CodeBench

100

human-eval

Code for the paper "Evaluating Large Language Models Trained on Code"

Language:PythonMIT1 10

.github

030

evalplus

EvalPlus for rigourous evaluation of LLM-synthesized code

Language:PythonApache-2.0010

pytorch_sphinx_theme

Sphinx Theme for OpenCompass - Modified from PyTorch

Language:CSSMIT010