There are 2 repositories under llm-as-a-judge topic.
The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one place.
Evaluate your LLM's response with Prometheus and GPT4 💯
[ICLR 2025] xFinder: Large Language Models as Automated Evaluators for Reliable Evaluation
xVerify: Efficient Answer Verifier for Reasoning Model Evaluations
CodeUltraFeedback: aligning large language models to coding preferences (TOSEM 2025)
This is the repo for the survey of Bias and Fairness in IR with LLMs.
Official implementation for "MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?"
First-of-its-kind AI benchmark for evaluating the protection capabilities of large language model (LLM) guard systems (guardrails and safeguards)
A set of tools to create synthetically-generated data from documents
Code and data for "Timo: Towards Better Temporal Reasoning for Language Models" (COLM 2024)
Code and data for Koo et al's ACL 2024 paper "Benchmarking Cognitive Biases in Large Language Models as Evaluators"
The official repository for our EMNLP 2024 paper, Themis: A Reference-free NLG Evaluation Language Model with Flexibility and Interpretability.
Harnessing Large Language Models for Curated Code Reviews
MCP as a Judge is a behavioral MCP that strengthens AI coding assistants by requiring explicit LLM evaluations
MCP for Root Signals Evaluation Platform
A set of examples demonstrating how to evaluate Generative AI augmented systems using traditional information retrieval and LLM-As-A-Judge validation techniques
LLM-as-judge evals as Semantic Kernel Plugins
The official repository for our ACL 2024 paper: Are LLM-based Evaluators Confusing NLG Quality Criteria?
A comprehensive study of the LLM-as-a-judge paradigm in a controlled setup that reveals new results about its strengths and weaknesses.
LLM-as-a-judge for Extractive QA datasets
Explore techniques to use small models as jailbreaking judges
Controversial Questions for Argumentation and Retrieval
A Multi Agent Systems Framework, written in Rust. Domain Agents, specialists, can use tools. Workflow Agents can load or define a workflow and monitor execution. LLM as a Judge is used for evaluation. Discovery Service and Memory Service empower agent interactions.
The code for ACL 2025 "RUBRIC-MQM : Span-Level LLM-as-judge in Machine Translation For High-End Models"
Notebooks for evaluating LLM based applications using the Model (LLM) as a judge pattern.
A tiny, lightweight library for LLM-as-a-Judge evaluations on vLLM-hosted models.
Official implementation for the paper "Debatable Intelligence: Benchmarking LLM Judges via Debate Speech Evaluation"
The official repository for our ACL 2025 paper: A Dual-Perspective NLG Meta-Evaluation Framework with Automatic Benchmark and Better Interpretability
Replication package for PROBE-SWE: a dynamic benchmark to generate, validate, and analyze data-induced cognitive biases in GPAI on typical software-engineering dilemmas.
An intelligent chatbot to provide information about courses, exams, services and procedures of the Catholic University using RAG (Retrieval-Augmented Generation) technologies
What if AI models were judging your performance review or resume? This system reveals the hidden biases and preferences of AI judges by running competitive tournaments between different writing styles and optimization strategies.