There are 2 repositories under evaluation-metrics topic.
The LLM Evaluation Framework
(IROS 2020, ECCVW 2020) Official Python Implementation for "3D Multi-Object Tracking: A Baseline and New Evaluation Metrics"
[NeurIPS'21 Outstanding Paper] Library for reliable evaluation on RL and ML benchmarks, even with only a handful of seeds.
:chart_with_upwards_trend: Implementation of eight evaluation metrics to access the similarity between two images. The eight metrics are as follows: RMSE, PSNR, SSIM, ISSM, FSIM, SRE, SAM, and UIQ.
PyNLPl, pronounced as 'pineapple', is a Python library for Natural Language Processing. It contains various modules useful for common, and less common, NLP tasks. PyNLPl can be used for basic tasks such as the extraction of n-grams and frequency lists, and to build simple language model. There are also more complex data types and algorithms. Moreover, there are parsers for file formats common in NLP (e.g. FoLiA/Giza/Moses/ARPA/Timbl/CQL). There are also clients to interface with various NLP specific servers. PyNLPl most notably features a very extensive library for working with FoLiA XML (Format for Linguistic Annotation).
LightEval is a lightweight LLM evaluation suite that Hugging Face has been using internally with the recently released LLM data processing library datatrove and LLM training library nanotron.
Open-Source Evaluation for GenAI Application Pipelines
Python SDK for agent evals and observability
Resources for the "Evaluating the Factual Consistency of Abstractive Text Summarization" paper
A Python wrapper for the ROUGE summarization evaluation package
Code base for the precision, recall, density, and coverage metrics for generative models. ICML 2020.
An implementation of a full named-entity evaluation metrics based on SemEval'13 Task 9 - not at tag/token level but considering all the tokens that are part of the named-entity
[ICLR'24] Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
It is a Natural Language Processing Problem where Sentiment Analysis is done by Classifying the Positive tweets from negative tweets by machine learning models for classification, text mining, text analysis, data analysis and data visualization
Metrics to evaluate the quality of responses of your Retrieval Augmented Generation (RAG) applications.
Python wrapper for evaluating summarization quality by ROUGE package
PySODEvalToolkit: A Python-based Evaluation Toolbox for Salient Object Detection and Camouflaged Object Detection
Full named-entity (i.e., not tag/token) evaluation metrics based on SemEval’13
A fast implementation of bss_eval metrics for blind source separation
Python SDK for running evaluations on LLM generated responses
Evaluating Vision & Language Pretraining Models with Objects, Attributes and Relations.
GOM:New Metric for Re-identification. 👉GOM explicitly balances the effect of performing retrieval and verification into a single unified metric.
Code for "Semantic Object Accuracy for Generative Text-to-Image Synthesis" (TPAMI 2020)
NeurIPS 2023 - TopP&R: Robust Support Estimation Approach for Evaluating Fidelity and Diversity in Generative Models Official Code
Assessing Generative Models via Precision and Recall (official repository)
EMNLP 2021 - CTC: A Unified Framework for Evaluating Natural Language Generation
📝 Reference-Free automatic summarization evaluation with potential hallucination detection
:gift:[ChatGPT4MTevaluation] ErrorAnalysis Prompt for MT Evaluation in ChatGPT