llm-evaluation

There are 40 repositories under llm-evaluation topic.

mlflow
mlflow / mlflow
The open source developer platform to build AI/LLM applications and models with confidence. Enhance your AI applications with end-to-end tracking, observability, and evaluations, all in one integrated platform.
agentops agents ai ai-governance apache-spark evaluation langchain llm-evaluation llmops machine-learning ml mlflow mlops model-management observability open-source openai prompt-engineering
Language:Python 22849
langfuse
langfuse / langfuse
🪢 Open source LLM engineering platform: LLM Observability, metrics, evals, prompt management, playground, datasets. Integrates with OpenTelemetry, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23
analytics autogen evaluation langchain large-language-models llama-index llm llm-evaluation llm-observability llmops monitoring observability open-source openai playground prompt-engineering prompt-management self-hosted ycombinator
Language:TypeScript 18059
comet-ml / opik
Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.
hacktoberfest hacktoberfest2025 langchain llama-index llm llm-evaluation llm-observability llmops open-source openai playground prompt-engineering
Language:Python 15501
confident-ai / deepeval
The LLM Evaluation Framework
evaluation-framework evaluation-metrics hacktoberfest llm-evaluation llm-evaluation-framework llm-evaluation-metrics python
Language:Python 11992
promptfoo / promptfoo
Test your prompts, agents, and RAGs. AI Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.
llm prompt-engineering prompts llmops prompt-testing testing rag evaluation evaluation-framework llm-eval llm-evaluation llm-evaluation-framework ci ci-cd cicd pentesting red-teaming vulnerability-scanners
Language:TypeScript 8997
phoenix
Arize-ai / phoenix
AI Observability & Evaluation
llmops ai-monitoring ai-observability llm-eval aiengineering datasets agents llms prompt-engineering anthropic evals llm-evaluation openai langchain llamaindex smolagents
Language:Jupyter Notebook 7628
NVIDIA / garak
the LLM vulnerability scanner
ai llm-evaluation llm-security security-scanners vulnerability-assessment
Language:Python 6324
jeinlee1991 / chinese-llm-benchmark
ReLE评测：中文AI大模型能力评测（持续更新）：目前已囊括303个大模型，覆盖chatgpt、gpt-5、o4-mini、谷歌gemini-2.5、Claude4.5、智谱GLM-Z1、文心一言、qwen3-max、百川、讯飞星火、商汤senseChat、minimax等商用模型，以及kimi-k2、ernie4.5、minimax-M1、DeepSeek-R1-0528、deepseek-v3.2、qwen3-2507、llama4、GLM4.5、gemma3、mistral等开源大模型。不仅提供排行榜，也提供规模超200万的大模型缺陷库！方便广大社区研究分析、改进大模型。
agentic-ai artificial-intelligence llm-agent llm-evaluation
5083
giskard-oss
Giskard-AI / giskard-oss
🐢 Open-Source Evaluation & Testing library for LLM Agents
agent-evaluation ai-red-team ai-security ai-testing fairness-ai llm llm-eval llm-evaluation llm-security llmops ml-testing ml-validation mlops rag-evaluation red-team-tools responsible-ai trustworthy-ai
Language:Python 4964
Helicone / helicone
🧊 Open source LLM observability platform. One line of code to monitor, evaluate, and experiment. YC W23 🍓
large-language-models prompt-engineering agent-monitoring analytics evaluation gpt langchain llama-index llm llm-cost llm-evaluation llm-observability llmops monitoring open-source openai playground prompt-management ycombinator
Language:TypeScript 4698
AutoRAG
Marker-Inc-Korea / AutoRAG
AutoRAG: An Open-Source Framework for Retrieval-Augmented Generation (RAG) Evaluation & Optimization with AutoML-Style Automation
analysis automl benchmarking document-parser embeddings evaluation llm llm-evaluation llm-ops open-source ops optimization pipeline python qa rag rag-evaluation retrieval-augmented-generation
Language:Python 4388
PacktPublishing / LLM-Engineers-Handbook
The LLM's practical guide: From the fundamentals to deploying advanced LLM and RAG apps to AWS using LLMOps best practices
aws fine-tuning-llm genai llm llm-evaluation llmops ml-system-design mlops rag
Language:Python 4349
agenta
Agenta-AI / agenta
The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one place.
llm-as-a-judge llm-evaluation llm-framework llm-monitoring llm-observability llm-platform llm-playground llm-tools llmops-platform prompt-engineering prompt-management rag-evaluation
Language:Python 3201
truera / trulens
Evaluation and Tracking for LLM Experiments and AI Agents
agent-evaluation agentops ai-agents ai-monitoring ai-observability evals explainable-ml llm-eval llm-evaluation llmops llms machine-learning neural-networks
Language:Python 2900
lmnr-ai / lmnr
Laminar - open-source all-in-one platform for engineering AI products. Create data flywheel for your AI app. Traces, Evals, Datasets, Labels. YC S24.
agents ai ai-observability aiops analytics developer-tools evals evaluation llm-evaluation llm-observability llm-workflow llmops monitoring observability open-source rust rust-lang self-hosted ts typescript
Language:TypeScript 2392
agentic_security
msoedov / agentic_security
Agentic LLM Vulnerability Scanner / AI red teaming kit 🧪
llm-guardrails llm-security llm-jailbreaks llm-scanner llm-vulnerabilities llm-fuzzer llm-fuzzing llm-fuzzer-aggregator ai-red-team llm-evaluation llm-evaluation-framework prompt-testing agent-security agent-framework
Language:Python 1670
huggingface / aisheets
Build, enrich, and transform datasets using AI models with no code
ai llm-evaluation llms nocode oss synthetic-data
Language:TypeScript 1558
genieincodebottle / generative-ai
Comprehensive resources on Generative AI, including a detailed roadmap, projects, use cases, interview preparation, and coding preparation.
generative-ai agentic-ai claude gemini genai genai-usecase interview-questions langchain langgraph large-language-model llama3 llm-agent llm-evaluation mcp model-context-protocol multimodal n8n n8n-workflow openai-api retrieval-augmented-generation
Language:Jupyter Notebook 1547
microsoft / prompty
Prompty makes it easy to create, manage, debug, and evaluate LLM prompts for your AI applications. Prompty is an asset class and format for LLM prompts designed to enhance observability, understandability, and portability for developers.
generative-ai llm-evaluation llms promptengineering prompty
Language:Python 1078
cvs-health / uqlm
UQLM: Uncertainty Quantification for Language Models, is a Python package for UQ-based LLM hallucination detection
ai-evaluation ai-safety hallucination hallucination-detection hallucination-evaluation hallucination-mitigation llm llm-evaluation llm-hallucination llm-safety uncertainty-estimation uncertainty-quantification confidence-estimation confidence-score
Language:Python 1059
judgeval
JudgmentLabs / judgeval
The open source post-building layer for agents. Our environment data and evals power agent post-training (RL, SFT) and monitoring.
agent agentic-ai agents grpo langchain langgraph llama-index llm llm-evaluation llm-observability open-source openai prompt-engineering reinforcement-learning rl
Language:Python 1013
FuzzyAI
cyberark / FuzzyAI
A powerful tool for automated LLM fuzzing. It is designed to help developers and security researchers identify and mitigate potential jailbreaks in their LLM APIs.
jailbreak jailbreaking llm llms ai security fuzzing llm-evaluation llm-security ai-red-team
Language:Jupyter Notebook 853
onejune2018 / Awesome-LLM-Eval
Awesome-LLM-Eval: a curated list of tools, datasets/benchmark, demos, leaderboard, papers, docs and models, mainly for Evaluation on LLMs. 一个由工具、基准/数据、演示、排行榜和大模型等组成的精选列表，主要面向基础大模型评测，旨在探求生成式AI的技术边界.
awsome-list awsome-lists benchmark bert chatglm chatgpt dataset evaluation gpt3 large-language-model leaderboard llama llm llm-evaluation machine-learning nlp openai qwen rag
578
ValueByte-AI / Awesome-LLM-in-Social-Science
Awesome papers involving LLMs in Social Science.
alignment economics large-language-models llm-agent llm-evaluation llms policy psychology simulation-environment social-network social-science
550
relari-ai / continuous-eval
Data-Driven Evaluation for LLM-Powered Applications
evaluation-framework evaluation-metrics information-retrieval llm-evaluation llmops rag retrieval-augmented-generation
Language:Python 510
palico-ai / palico-ai
Build, Improve Performance, and Productionize your LLM Application with an Integrated Framework
langchain langchain-js llamaindex llm llm-agent llm-evaluation llm-framework portkey rag javascript typescript llm-observability autogen anthropic openai ai nodejs full-stack docker
Language:TypeScript 341
JonathanChavezTamales / llm-leaderboard
A comprehensive set of LLM benchmark scores and provider prices.
llm llm-agents llm-evaluation llmops llms-benchmarking
Language:JavaScript 313
athina-ai / athina-evals
Python SDK for running evaluations on LLM generated responses
evaluation evaluation-framework evaluation-metrics llm-eval llm-evaluation llm-evaluation-toolkit llm-ops llmops
Language:Python 292
PetroIvaniuk / llms-tools
A list of LLMs Tools & Projects
ai chat-bot chatbots chatgpt data-science llm llm-evaluation machine-learning open-source-llm
283
iMeanAI / WebCanvas
All-in-one Web Agent framework for post-training. Start building with a few clicks!
agent benchmark-framework llm-agent llm-evaluation
Language:Python 273
JinjieNi / MixEval
The official evaluation suite and dynamic data release for MixEval.
benchmark benchmark-mixture benchmarking-framework benchmarking-suite evaluation evaluation-framework foundation-models large-language-model large-language-models large-multimodal-models llm-evaluation llm-evaluation-framework llm-inference mixeval
Language:Python 246
cvs-health / langfair
LangFair is a Python library for conducting use-case level LLM bias and fairness assessments
ai artificial-intelligence bias bias-detection ethical-ai fairness fairness-ai fairness-ml fairness-testing large-language-models llm responsible-ai python ai-safety llm-evaluation llm-evaluation-framework llm-evaluation-metrics
Language:Python 240
HiThink-Research / BizFinBench
A Business-Driven Real-World Financial Benchmark for Evaluating LLMs
benchmark finance llm llm-benchmarking llm-evaluation
Language:Python 208
alopatenko / LLMEvaluation
A comprehensive guide to LLM evaluation methods designed to assist in identifying the most suitable evaluation techniques for various use cases, promote the adoption of best practices in LLM assessment, and critically assess the effectiveness of these evaluation methods.
evaluation generative-ai-benchmarking llm llm-benchmarking llm-evaluation
Language:HTML 141
fuxiAIlab / CivAgent
CivAgent is an LLM-based Human-like Agent acting as a Digital Player within the Strategy Game Unciv.
aiagent game llm-agent llm-evaluation
Language:Python 132
open-compass / GTA
[NeurIPS 2024 D&B Track] GTA: A Benchmark for General Tool Agents
llm-agent llm-evaluation
Language:Python 128

llm-evaluation

mlflow / mlflow

langfuse / langfuse

comet-ml / opik

confident-ai / deepeval

promptfoo / promptfoo

Arize-ai / phoenix

NVIDIA / garak

jeinlee1991 / chinese-llm-benchmark

Giskard-AI / giskard-oss

Helicone / helicone

Marker-Inc-Korea / AutoRAG

PacktPublishing / LLM-Engineers-Handbook

Agenta-AI / agenta

truera / trulens

lmnr-ai / lmnr

msoedov / agentic_security

huggingface / aisheets

genieincodebottle / generative-ai

microsoft / prompty

cvs-health / uqlm

JudgmentLabs / judgeval

cyberark / FuzzyAI

onejune2018 / Awesome-LLM-Eval

ValueByte-AI / Awesome-LLM-in-Social-Science

relari-ai / continuous-eval

palico-ai / palico-ai

JonathanChavezTamales / llm-leaderboard

athina-ai / athina-evals

PetroIvaniuk / llms-tools

iMeanAI / WebCanvas

JinjieNi / MixEval

cvs-health / langfair

HiThink-Research / BizFinBench

alopatenko / LLMEvaluation

fuxiAIlab / CivAgent

open-compass / GTA