There are 40 repositories under llm-evaluation topic.
The open source developer platform to build AI/LLM applications and models with confidence. Enhance your AI applications with end-to-end tracking, observability, and evaluations, all in one integrated platform.
🪢 Open source LLM engineering platform: LLM Observability, metrics, evals, prompt management, playground, datasets. Integrates with OpenTelemetry, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23
Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.
The LLM Evaluation Framework
Test your prompts, agents, and RAGs. AI Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.
ReLE评测:中文AI大模型能力评测(持续更新):目前已囊括303个大模型,覆盖chatgpt、gpt-5、o4-mini、谷歌gemini-2.5、Claude4.5、智谱GLM-Z1、文心一言、qwen3-max、百川、讯飞星火、商汤senseChat、minimax等商用模型, 以及kimi-k2、ernie4.5、minimax-M1、DeepSeek-R1-0528、deepseek-v3.2、qwen3-2507、llama4、GLM4.5、gemma3、mistral等开源大模型。不仅提供排行榜,也提供规模超200万的大模型缺陷库!方便广大社区研究分析、改进大模型。
🐢 Open-Source Evaluation & Testing library for LLM Agents
🧊 Open source LLM observability platform. One line of code to monitor, evaluate, and experiment. YC W23 🍓
AutoRAG: An Open-Source Framework for Retrieval-Augmented Generation (RAG) Evaluation & Optimization with AutoML-Style Automation
The LLM's practical guide: From the fundamentals to deploying advanced LLM and RAG apps to AWS using LLMOps best practices
The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one place.
Laminar - open-source all-in-one platform for engineering AI products. Create data flywheel for your AI app. Traces, Evals, Datasets, Labels. YC S24.
Agentic LLM Vulnerability Scanner / AI red teaming kit 🧪
Build, enrich, and transform datasets using AI models with no code
Comprehensive resources on Generative AI, including a detailed roadmap, projects, use cases, interview preparation, and coding preparation.
Prompty makes it easy to create, manage, debug, and evaluate LLM prompts for your AI applications. Prompty is an asset class and format for LLM prompts designed to enhance observability, understandability, and portability for developers.
UQLM: Uncertainty Quantification for Language Models, is a Python package for UQ-based LLM hallucination detection
The open source post-building layer for agents. Our environment data and evals power agent post-training (RL, SFT) and monitoring.
A powerful tool for automated LLM fuzzing. It is designed to help developers and security researchers identify and mitigate potential jailbreaks in their LLM APIs.
Awesome-LLM-Eval: a curated list of tools, datasets/benchmark, demos, leaderboard, papers, docs and models, mainly for Evaluation on LLMs. 一个由工具、基准/数据、演示、排行榜和大模型等组成的精选列表,主要面向基础大模型评测,旨在探求生成式AI的技术边界.
Awesome papers involving LLMs in Social Science.
Data-Driven Evaluation for LLM-Powered Applications
Build, Improve Performance, and Productionize your LLM Application with an Integrated Framework
A comprehensive set of LLM benchmark scores and provider prices.
Python SDK for running evaluations on LLM generated responses
A list of LLMs Tools & Projects
All-in-one Web Agent framework for post-training. Start building with a few clicks!
LangFair is a Python library for conducting use-case level LLM bias and fairness assessments
A Business-Driven Real-World Financial Benchmark for Evaluating LLMs
A comprehensive guide to LLM evaluation methods designed to assist in identifying the most suitable evaluation techniques for various use cases, promote the adoption of best practices in LLM assessment, and critically assess the effectiveness of these evaluation methods.
CivAgent is an LLM-based Human-like Agent acting as a Digital Player within the Strategy Game Unciv.
[NeurIPS 2024 D&B Track] GTA: A Benchmark for General Tool Agents