llm-as-a-judge

There are 2 repositories under llm-as-a-judge topic.

agenta
Agenta-AI / agenta
The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one place.
llm-as-a-judge llm-evaluation llm-framework llm-monitoring llm-observability llm-platform llm-playground llm-tools llmops-platform prompt-engineering prompt-management rag-evaluation
Language:Python 3340
prometheus-eval / prometheus-eval
Evaluate your LLM's response with Prometheus and GPT4 💯
evaluation litellm llm llmops python vllm gpt4 llm-as-a-judge llm-as-evaluator
Language:Python 1009
metauto-ai / agent-as-a-judge
👩‍⚖️ Coding Agent-as-a-Judge
agent-as-a-judge llm-as-a-judge llms
Language:Python 662
haizelabs / verdict
Inference-time scaling for LLMs-as-a-judge.
inference-time-compute llm llm-as-a-judge llm-judge reward-shaping test-time-compute test-time-scaling
Language:Jupyter Notebook 308
IAAR-Shanghai / xFinder
[ICLR 2025] xFinder: Large Language Models as Automated Evaluators for Reliable Evaluation
benchmark cc-by-nc-nd-4 chatglm dataset evaluation gpt judge-model key-answer-extraction large-language-models llm llm-as-a-judge llm-as-evaluator lm-evaluation open-compass phi qwen regex reliability reliable-evaluation xfinder
Language:Python 176
IAAR-Shanghai / xVerify
xVerify: Efficient Answer Verifier for Reasoning Model Evaluations
judge-model llm-as-a-judge xverify benchmark evaluation regex reliability reliability-tools math-verify deepseek-math open-compass cc-by-nc-nd-4 chatgpt llm open-r1 reasoning-models
Language:Python 137
martin-wey / CodeUltraFeedback
CodeUltraFeedback: aligning large language models to coding preferences (TOSEM 2025)
code-generation codeultrafeedback dpo large-language-models llm-as-a-judge reinforcement-learning-from-human-feedback
Language:Python 72
KID-22 / LLM-IR-Bias-Fairness-Survey
This is the repo for the survey of Bias and Fairness in IR with LLMs.
bias fairness information-retrieval large-language-models recommender-systems chatgpt ir llm llm-as-a-judge llm4rec llm4rs llm-as-evaluator llm4ir
58
MJ-Bench / MJ-Bench
Official implementation for "MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?"
llm-as-a-judge llm-benchmarking multimodal-foundation-model reward-models multimodal-judge
Language:Jupyter Notebook 47
lupantech / ineqmath
Solving Inequality Proofs with Large Language Models.
inequality llm-as-a-judge llms olympiad theorem-proving math-reasoning
Language:Python 44
circle-guard-bench
whitecircle-ai / circle-guard-bench
First-of-its-kind AI benchmark for evaluating the protection capabilities of large language model (LLM) guard systems (guardrails and safeguards)
ai benchmark large-language-model large-language-models llm llm-eval llm-evaluation guardrails safeguard benchmarking guardrail jailbreak llm-as-a-judge llm-jailbreaks llm-security
Language:Python 44
docling-project / docling-sdg
A set of tools to create synthetically-generated data from documents
ai documents llm-as-a-judge question-answering sdg
Language:Python 35
zhaochen0110 / Timo
Code and data for "Timo: Towards Better Temporal Reasoning for Language Models" (COLM 2024)
llm-as-a-judge llm-as-evaluator llms rlaif rlhf sota-model temporal-reasoning self-critic-framework colm2024
Language:Python 24
minnesotanlp / cobbler
Code and data for Koo et al's ACL 2024 paper "Benchmarking Cognitive Biases in Large Language Models as Evaluators"
bias evaluation llm nlp bias-detection llm-as-a-judge llm-as-evaluator llm-as-judge llm-evaluation llms llms-benchmarking
Language:Jupyter Notebook 21
PKU-ONELab / Themis
The official repository for our EMNLP 2024 paper, Themis: A Reference-free NLG Evaluation Language Model with Flexibility and Interpretability.
evaluation llm-as-a-judge nlg
Language:Python 20
OussamaSghaier / CuREV
Harnessing Large Language Models for Curated Code Reviews
code-review dataset-curation empirical-software-engineering large-language-models llm-as-a-judge software-maintenance
Language:Python 15
mcp-as-a-judge
OtherVibes / mcp-as-a-judge
MCP as a Judge is a behavioral MCP that strengthens AI coding assistants by requiring explicit LLM evaluations
elicitation evaluation llm-as-a-judge mcp sampling sdd spec-driven-development mcp-as-a-judge
Language:Python 13
root-signals / rs-sdk
Root Signals SDK
evaluation llm llm-as-a-judge observability evals
Language:Python 12
root-signals / root-signals-mcp
MCP for Root Signals Evaluation Platform
evals llm-as-a-judge mcp model-context-protocol agentic-ai pydantic-ai
Language:Python 10
aws-samples / genai-system-evaluation
A set of examples demonstrating how to evaluate Generative AI augmented systems using traditional information retrieval and LLM-As-A-Judge validation techniques
genai generative-ai information-retrieval llm-as-a-judge llm-evaluation
Language:Jupyter Notebook 9
HillPhelmuth / LlmAsJudgeEvalPlugins
LLM-as-judge evals as Semantic Kernel Plugins
llm-as-a-judge llm-as-evaluator llm-evaluation semantickernel
Language:C# 8
PKU-ONELab / LLM-evaluator-reliability
The official repository for our ACL 2024 paper: Are LLM-based Evaluators Confusing NLG Quality Criteria?
evaluation llm-as-a-judge nlg
Language:Python 8
UMass-Meta-LLM-Eval / llm_eval
A comprehensive study of the LLM-as-a-judge paradigm in a controlled setup that reveals new results about its strengths and weaknesses.
large-language-models llm-as-a-judge nlp-machine-learning
Language:Python 8
Alab-NII / llm-judge-extract-qa
LLM-as-a-judge for Extractive QA datasets
ai-evaluation evaluation evaluation-metrics llm-as-a-judge qa
Language:Python 5
romaingrx / llm-as-a-jailbreak-judge
Explore techniques to use small models as jailbreaking judges
aisafety jailbreak llm-as-a-judge
Language:Python 5
emory-irlab / conqret-rag
Controversial Questions for Argumentation and Retrieval
argumentation hallucination llm-as-a-judge retrieval retrieval-augmented-generation
Language:Python 4
fcn06 / swarm
A Multi Agent Systems Framework, written in Rust. Domain Agents, specialists, can use tools. Workflow Agents can load or define a workflow and monitor execution. LLM as a Judge is used for evaluation. Discovery Service and Memory Service empower agent interactions.
agentic-ai rust multi-agent-systems modelcontextprotocol model-context-protocol discovery-service memory-service llm-as-a-judge gpt-oss openai-compatible-api dynamic-planning workflow-agent domain-agents scrape-urls wikipedia-search
Language:Rust 4
trotacodigos / Rubric-MQM
The code for ACL 2025 "RUBRIC-MQM : Span-Level LLM-as-judge in Machine Translation For High-End Models"
human-computer-interaction llm-as-a-judge machine-translation-metrics mqm
Language:Python 4
aws-samples / model-as-a-judge-eval
Notebooks for evaluating LLM based applications using the Model (LLM) as a judge pattern.
evaluation generative-ai llm llm-as-a-judge
Language:Jupyter Notebook 3
djokester / groqeval
Use groq for evaluations
generative-ai llama3 llm llm-as-a-judge llm-as-evaluator mixtral groq
Language:Python 3
trustyai-explainability / vllm_judge
A tiny, lightweight library for LLM-as-a-Judge evaluations on vLLM-hosted models.
evaluation-metrics llm-as-a-judge llm-as-evaluator llm-evaluation llmops
Language:Python 2
Debatable-Intelligence
noy-sternlicht / Debatable-Intelligence
Official implementation for the paper "Debatable Intelligence: Benchmarking LLM Judges via Debate Speech Evaluation"
argumentation benchmark benchmarking debate evaluation llm-as-a-judge llms
Language:Python 1
PKU-ONELab / NLG-DualEval
The official repository for our ACL 2025 paper: A Dual-Perspective NLG Meta-Evaluation Framework with Automatic Benchmark and Better Interpretability
evaluation llm-as-a-judge nlg
Language:Python 1
Francesco-Sovrano / PROB-SWE
Replication package for PROBE-SWE: a dynamic benchmark to generate, validate, and analyze data-induced cognitive biases in GPAI on typical software-engineering dilemmas.
ai-evaluation benchmarking bias-detection cognitive-bias dataset-generation deepseek dynamic-benchmark general-purpose-ai gpt-4o llama llm-as-a-judge llm-evaluation openai prolog reasoning replication-package reproducible-research software-engineering bias-sensitivity probe-swe
Language:Jupyter Notebook
nluninja / studentsbot
An intelligent chatbot to provide information about courses, exams, services and procedures of the Catholic University using RAG (Retrieval-Augmented Generation) technologies
llm-as-a-judge rag rag-chatbot rageval
Language:Python
sshh12 / perf-review
What if AI models were judging your performance review or resume? This system reveals the hidden biases and preferences of AI judges by running competitive tournaments between different writing styles and optimization strategies.
llm-as-a-judge performance-reviews resume-builder
Language:Python

llm-as-a-judge

Agenta-AI / agenta

prometheus-eval / prometheus-eval

metauto-ai / agent-as-a-judge

haizelabs / verdict

IAAR-Shanghai / xFinder

IAAR-Shanghai / xVerify

martin-wey / CodeUltraFeedback

KID-22 / LLM-IR-Bias-Fairness-Survey

MJ-Bench / MJ-Bench

lupantech / ineqmath

whitecircle-ai / circle-guard-bench

docling-project / docling-sdg

zhaochen0110 / Timo

minnesotanlp / cobbler

PKU-ONELab / Themis

OussamaSghaier / CuREV

OtherVibes / mcp-as-a-judge

root-signals / rs-sdk

root-signals / root-signals-mcp

aws-samples / genai-system-evaluation

HillPhelmuth / LlmAsJudgeEvalPlugins

PKU-ONELab / LLM-evaluator-reliability

UMass-Meta-LLM-Eval / llm_eval

Alab-NII / llm-judge-extract-qa

romaingrx / llm-as-a-jailbreak-judge

emory-irlab / conqret-rag

fcn06 / swarm

trotacodigos / Rubric-MQM

aws-samples / model-as-a-judge-eval

djokester / groqeval

trustyai-explainability / vllm_judge

noy-sternlicht / Debatable-Intelligence

PKU-ONELab / NLG-DualEval

Francesco-Sovrano / PROB-SWE

nluninja / studentsbot

sshh12 / perf-review