evaluation-metrics

There are 5 repositories under evaluation-metrics topic.

confident-ai / deepeval
The LLM Evaluation Framework
evaluation-framework evaluation-metrics llm-evaluation llm-evaluation-framework llm-evaluation-metrics
Language:Python 10777
AgentOps-AI / agentops
Python SDK for AI agent monitoring, LLM cost tracking, benchmarking, and more. Integrates with most LLMs and agent frameworks including CrewAI, Agno, OpenAI Agents SDK, Langchain, Autogen, AG2, and CamelAI
agent agentops ai evals evaluation-metrics llm anthropic autogen cost-estimation crewai groq langchain mistral ollama openai agents-sdk openai-agents
Language:Python 4885
datawhalechina / tiny-universe
《大模型白盒子构建指南》：一个全手搓的Tiny-Universe
agent diffusion evaluation-metrics llama qwen rag transformers
Language:Jupyter Notebook 3728
huggingface / lighteval
Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends
evaluation evaluation-framework evaluation-metrics huggingface
Language:Python 1894
xinshuoweng / AB3DMOT
(IROS 2020, ECCVW 2020) Official Python Implementation for "3D Multi-Object Tracking: A Baseline and New Evaluation Metrics"
computer-vision machine-learning robotics tracking 3d-tracking multi-object-tracking real-time evaluation-metrics evaluation 3d-multi-object-tracking 2d-mot-evaluation kitti 3d-mot 3d-multi kitti-3d
Language:Python 1779
huggingface / evaluation-guidebook
Sharing both practical insights and theoretical knowledge about LLM evaluation that we gathered while managing the Open LLM Leaderboard and designing lighteval!
evaluation evaluation-metrics guidebook large-language-models llm machine-learning tutorial
Language:Jupyter Notebook 1568
google-research / rliable
[NeurIPS'21 Outstanding Paper] Library for reliable evaluation on RL and ML benchmarks, even with only a handful of seeds.
reinforcement-learning benchmarking evaluation-metrics machine-learning google rl
Language:Jupyter Notebook 824
OCTIS
MIND-Lab / OCTIS
OCTIS: Comparing Topic Models is Simple! A python package to optimize and evaluate topic models (accepted at EACL2021 demo track)
topic-modeling latent-dirichlet-allocation latent-semantic-analysis evaluation-metrics natural-language-processing non-negative-matrix-factorization neural-topic-models bayesian-optimization hyperparameter-optimization hyperparameter-tuning hyperparameter-search topic-models nlp nlproc nlp-library
Language:Python 783
jitsi / jiwer
Evaluate your speech-to-text system with similarity measures such as word error rate (WER)
wer automatic-speech-recognition python3 speech-to-text evaluation-metrics word-error-rate
Language:Python 710
nekhtiari / image-similarity-measures
:chart_with_upwards_trend: Implementation of eight evaluation metrics to access the similarity between two images. The eight metrics are as follows: RMSE, PSNR, SSIM, ISSM, FSIM, SRE, SAM, and UIQ.
evaluation-metrics image machine-learning metrics p1 processing
Language:Python 611
ranx
AmenRa / ranx
⚡️A Blazing-Fast Python Library for Ranking Evaluation, Comparison, and Fusion 🐍
ranking-metrics numba python evaluation evaluation-metrics information-retrieval recommender-systems information-retrieval-evaluation information-retrieval-metrics data-fusion metasearch rank-fusion score-fusion comparison
Language:Python 591
COMET
Unbabel / COMET
A Neural Framework for MT Evaluation
artificial-intelligence evaluation-metrics machine-learning machine-translation natural-language-processing nlp
Language:Python 568
relari-ai / continuous-eval
Data-Driven Evaluation for LLM-Powered Applications
evaluation-framework evaluation-metrics information-retrieval llm-evaluation llmops rag retrieval-augmented-generation
Language:Python 487
proycon / pynlpl
PyNLPl, pronounced as 'pineapple', is a Python library for Natural Language Processing. It contains various modules useful for common, and less common, NLP tasks. PyNLPl can be used for basic tasks such as the extraction of n-grams and frequency lists, and to build simple language model. There are also more complex data types and algorithms. Moreover, there are parsers for file formats common in NLP (e.g. FoLiA/Giza/Moses/ARPA/Timbl/CQL). There are also clients to interface with various NLP specific servers. PyNLPl most notably features a very extensive library for working with FoLiA XML (Format for Linguistic Annotation).
nlp python computational-linguistics linguistics library folia machine-learning language-modelling search-algorithms evaluation-metrics text-processing nlp-library natural-language-processing
Language:Python 477
v-iashin / SpecVQGAN
Source code for "Taming Visually Guided Sound Generation" (Oral at the BMVC 2021)
transformer vqvae gan pytorch audio-generation video-features melgan multi-modal video-understanding vggsound vas bmvc evaluation-metrics audio video
Language:Jupyter Notebook 360
JokerJohn / Cloud_Map_Evaluation
[RAL' 2025] MapEval: Towards Unified, Robust and Efficient SLAM Map Evaluation Framework.
map-evaluation open3d point-cloud pointcloud-registration slam wasserstein-distance evaluation-metrics lidar-point-cloud robotics slam-benchmarcking
Language:C++ 342
salesforce / factCC
Resources for the "Evaluating the Factual Consistency of Abstractive Text Summarization" paper
text-summarization evaluation-metrics
Language:Python 305
TonicAI / tonic_validate
Metrics to evaluate the quality of responses of your Retrieval Augmented Generation (RAG) applications.
evaluation-metrics large-language-models llm llmops llms rag retrieval-augmented-generation evaluation-framework
Language:Python 293
athina-ai / athina-evals
Python SDK for running evaluations on LLM generated responses
evaluation evaluation-framework evaluation-metrics llm-eval llm-evaluation llm-evaluation-toolkit llm-ops llmops
Language:Python 276
FuxiaoLiu / LRV-Instruction
[ICLR'24] Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
evaluation gpt-4 hallucination object-detection vision vqa llama vicuna llava gpt multimodal prompt-engineering chatgpt evaluation-metrics foundation-models vision-and-language iclr iclr2024
Language:Python 273
ziqihuangg / Awesome-Evaluation-of-Visual-Generation
A list of works on evaluation of visual generation models, including evaluation metrics, models, and systems
awesome benchmark evaluation evaluation-metrics evaluation-system generative-models image-generation video-generation
270
clovaai / generative-evaluation-prdc
Code base for the precision, recall, density, and coverage metrics for generative models. ICML 2020.
deep-learning generative-adversarial-network evaluation-metrics precision recall machine-learning generative-model fidelity diversity evaluation icml icml-2020 icml2020
Language:Python 254
sharmaroshan / Twitter-Sentiment-Analysis
It is a Natural Language Processing Problem where Sentiment Analysis is done by Classifying the Positive tweets from negative tweets by machine learning models for classification, text mining, text analysis, data analysis and data visualization
nlp sentiment-analysis data-analysis bag-of-words data-visualization eda machine-learning classification cross-validation evaluation-metrics count-vectorizer wordcloud hashtags datacleaning
Language:Jupyter Notebook 253
bheinzerling / pyrouge
A Python wrapper for the ROUGE summarization evaluation package
rouge summarization evaluation-metrics nlp
Language:Python 250
aws-samples / foundation-model-benchmarking-tool
Foundation model benchmarking tool. Run any model on any AWS platform and benchmark for performance across instance type and serving stack options.
benchmarking foundation-models inferentia llama2 p4d sagemaker generative-ai benchmark bedrock llama3 g5 p5 trainium evaluation-metrics g6 g6e deepseek deepseek-r1
Language:Jupyter Notebook 237
davidsbatista / NER-Evaluation
An implementation of a full named-entity evaluation metrics based on SemEval'13 Task 9 - not at tag/token level but considering all the tokens that are part of the named-entity
named-entity-recognition evaluation-metrics notebook-jupyter crfsuite semeval-2013 ner-evaluation ner semeval
Language:Python 219
wenhao728 / awesome-diffusion-v2v
Awesome diffusion Video-to-Video (V2V). A collection of paper on diffusion model-based video editing, aka. video-to-video (V2V) translation. And a video editing benchmark code.
benchmark diffusion-models evaluation-metrics survey video-editing video-to-video
Language:Python 211
IBM / unitxt
🦄 Unitxt is a Python library for enterprise-grade evaluation of AI performance, offering the world's largest catalog of tools and data for end-to-end AI benchmarking
ai data datasets evaluation evaluation-metrics llm mlops nlp nlp-library python vision
Language:Python 208
MantisAI / nervaluate
Full named-entity (i.e., not tag/token) evaluation metrics based on SemEval’13
natural-language-processing machine-learning named-entity-recognition sequence-models evaluation-metrics
Language:Python 190
clovaai / CLEval
CLEval: Character-Level Evaluation for Text Detection and Recognition Tasks
end-to-end-ocr evaluation-metrics text-detection text-detection-recognition text-recognition
Language:Python 185
lartpang / PySODEvalToolkit
PySODEvalToolkit: A Python-based Evaluation Toolbox for Salient Object Detection and Camouflaged Object Detection
python3 metrics metrics-visualization pr-curve fm-curve s-measure f-measure e-measure mae saliency saliency-detection salient-object-detection latex evaluation evaluation-metrics evaluation-framework evaluator camouflaged-object-detection co-saliency co-salient-object-detection
Language:Python 176
tagucci / pythonrouge
Python wrapper for evaluating summarization quality by ROUGE package
summarization rouge natural-language-processing document-summarization python evaluation-metrics text-summarization
Language:Perl 162
feralvam / easse
Easier Automatic Sentence Simplification Evaluation
evaluation-metrics natural-language-generation natural-language-processing nlproc sentence-simplification text-simplification
Language:Roff 159
fakufaku / fast_bss_eval
A fast implementation of bss_eval metrics for blind source separation
audio-processing blind-source-separation bss evaluation-metrics fast
Language:Python 135
om-ai-lab / VL-CheckList
Evaluating Vision & Language Pretraining Models with Objects, Attributes and Relations. [EMNLP 2022]
evaluation-metrics multimodal-deep-learning vision-and-language
Language:Python 129
YuanXinCherry / Person-reID-Evaluation
GOM：New Metric for Re-identification. 👉GOM explicitly balances the effect of performing retrieval and verification into a single unified metric.
re-identification evaluation-metrics
Language:Python 108

evaluation-metrics

confident-ai / deepeval

AgentOps-AI / agentops

datawhalechina / tiny-universe

huggingface / lighteval

xinshuoweng / AB3DMOT

huggingface / evaluation-guidebook

google-research / rliable

MIND-Lab / OCTIS

jitsi / jiwer

nekhtiari / image-similarity-measures

AmenRa / ranx

Unbabel / COMET

relari-ai / continuous-eval

proycon / pynlpl

v-iashin / SpecVQGAN

JokerJohn / Cloud_Map_Evaluation

salesforce / factCC

TonicAI / tonic_validate

athina-ai / athina-evals

FuxiaoLiu / LRV-Instruction

ziqihuangg / Awesome-Evaluation-of-Visual-Generation

clovaai / generative-evaluation-prdc

sharmaroshan / Twitter-Sentiment-Analysis

bheinzerling / pyrouge

aws-samples / foundation-model-benchmarking-tool

davidsbatista / NER-Evaluation

wenhao728 / awesome-diffusion-v2v

IBM / unitxt

MantisAI / nervaluate

clovaai / CLEval

lartpang / PySODEvalToolkit

tagucci / pythonrouge

feralvam / easse

fakufaku / fast_bss_eval

om-ai-lab / VL-CheckList

YuanXinCherry / Person-reID-Evaluation