There are 5 repositories under evaluation-metrics topic.
The LLM Evaluation Framework
Python SDK for AI agent monitoring, LLM cost tracking, benchmarking, and more. Integrates with most LLMs and agent frameworks including CrewAI, Agno, OpenAI Agents SDK, Langchain, Autogen, AG2, and CamelAI
《大模型白盒子构建指南》:一个全手搓的Tiny-Universe
Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends
(IROS 2020, ECCVW 2020) Official Python Implementation for "3D Multi-Object Tracking: A Baseline and New Evaluation Metrics"
Sharing both practical insights and theoretical knowledge about LLM evaluation that we gathered while managing the Open LLM Leaderboard and designing lighteval!
[NeurIPS'21 Outstanding Paper] Library for reliable evaluation on RL and ML benchmarks, even with only a handful of seeds.
OCTIS: Comparing Topic Models is Simple! A python package to optimize and evaluate topic models (accepted at EACL2021 demo track)
Evaluate your speech-to-text system with similarity measures such as word error rate (WER)
:chart_with_upwards_trend: Implementation of eight evaluation metrics to access the similarity between two images. The eight metrics are as follows: RMSE, PSNR, SSIM, ISSM, FSIM, SRE, SAM, and UIQ.
Data-Driven Evaluation for LLM-Powered Applications
PyNLPl, pronounced as 'pineapple', is a Python library for Natural Language Processing. It contains various modules useful for common, and less common, NLP tasks. PyNLPl can be used for basic tasks such as the extraction of n-grams and frequency lists, and to build simple language model. There are also more complex data types and algorithms. Moreover, there are parsers for file formats common in NLP (e.g. FoLiA/Giza/Moses/ARPA/Timbl/CQL). There are also clients to interface with various NLP specific servers. PyNLPl most notably features a very extensive library for working with FoLiA XML (Format for Linguistic Annotation).
Source code for "Taming Visually Guided Sound Generation" (Oral at the BMVC 2021)
[RAL' 2025] MapEval: Towards Unified, Robust and Efficient SLAM Map Evaluation Framework.
Resources for the "Evaluating the Factual Consistency of Abstractive Text Summarization" paper
Metrics to evaluate the quality of responses of your Retrieval Augmented Generation (RAG) applications.
Python SDK for running evaluations on LLM generated responses
[ICLR'24] Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
A list of works on evaluation of visual generation models, including evaluation metrics, models, and systems
Code base for the precision, recall, density, and coverage metrics for generative models. ICML 2020.
It is a Natural Language Processing Problem where Sentiment Analysis is done by Classifying the Positive tweets from negative tweets by machine learning models for classification, text mining, text analysis, data analysis and data visualization
A Python wrapper for the ROUGE summarization evaluation package
Foundation model benchmarking tool. Run any model on any AWS platform and benchmark for performance across instance type and serving stack options.
An implementation of a full named-entity evaluation metrics based on SemEval'13 Task 9 - not at tag/token level but considering all the tokens that are part of the named-entity
Awesome diffusion Video-to-Video (V2V). A collection of paper on diffusion model-based video editing, aka. video-to-video (V2V) translation. And a video editing benchmark code.
🦄 Unitxt is a Python library for enterprise-grade evaluation of AI performance, offering the world's largest catalog of tools and data for end-to-end AI benchmarking
Full named-entity (i.e., not tag/token) evaluation metrics based on SemEval’13
PySODEvalToolkit: A Python-based Evaluation Toolbox for Salient Object Detection and Camouflaged Object Detection
Python wrapper for evaluating summarization quality by ROUGE package
A fast implementation of bss_eval metrics for blind source separation
Evaluating Vision & Language Pretraining Models with Objects, Attributes and Relations. [EMNLP 2022]
GOM:New Metric for Re-identification. 👉GOM explicitly balances the effect of performing retrieval and verification into a single unified metric.