There are 9 repositories under llm-eval topic.
Test your prompts, agents, and RAGs. AI Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.
🐢 Open-Source Evaluation & Testing library for LLM Agents
UpTrain is an open-source unified platform to evaluate and improve Generative AI applications. We provide grades for 20+ preconfigured checks (covering language, code, embedding use-cases), perform root cause analysis on failure cases and give insights on how to resolve them.
A desktop MCP client designed as a tool unitary utility integration, accelerating AI adoption through the Model Context Protocol (MCP) and enabling cross-vendor LLM API orchestration.
Python SDK for running evaluations on LLM generated responses
A simple GPT-based evaluation tool for multi-aspect, interpretable assessment of LLMs.
Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)
llm-eval-simple is a simple LLM evaluation framework with intermediate actions and prompt pattern selection
First-of-its-kind AI benchmark for evaluating the protection capabilities of large language model (LLM) guard systems (guardrails and safeguards)
🎯 Your free LLM evaluation toolkit helps you assess the accuracy of facts, how well it understands context, its tone, and more. This helps you see how good your LLM applications are.
An open source library for asynchronous querying of LLM endpoints
LLM Security Platform.
This is an opensource project allowing you to compare two LLM's head to head with a given prompt, this section will be regarding the backend of this project, allowing for llm api's to be incorporated and used in the front-end
Realign is a testing and simulation framework for AI applications.
Run a prompt against all, or some, of your models running on Ollama. Creates web pages with the output, performance statistics and model info. All in a single Bash shell script.
Code for "Prediction-Powered Ranking of Large Language Models", NeurIPS 2024.
The prompt engineering, prompt management, and prompt evaluation tool for Python
[ACL 2025] GuessArena: Guess Who I Am? A Self-Adaptive Framework for Evaluating LLMs in Domain-Specific Knowledge and Reasoning
Create an evaluation framework for your LLM based app. Incorporate it into your test suite. Lay the monitoring foundation.
The prompt engineering, prompt management, and prompt evaluation tool for TypeScript, JavaScript, and NodeJS.
Generative agents — computational software agents that simulate believable human behavior and OpenAI LLM models. Our main focus was to develop a game - “Werewolves of Miller’s Hollow”, aiming to replicate human-like behavior.
TypeScript SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)
Community Plugin for Genkit to use Promptfoo
Evaluating LLMs with Multiple Problems at once: A New Paradigm for Probing LLM Capabilities
Shin Rakuda is a comprehensive framework for evaluating and benchmarking Japanese large language models, offering researchers and developers a flexible toolkit for assessing LLM performance across diverse datasets.
A prompt evaluation framework that provides comprehensive testing for AI agents across multiple providers.
LLM Security Platform Docs
Multi-agent simulation using LLMs. Agents autonomously decide actions for survival, reproduction, and social behavior in a grid world.This project aims to replicate a paper published in 2025 (arXiv:2508.12920).