There are 14 repositories under evaluation topic.
:metal: awesome-semantic-segmentation
🪢 Open source LLM engineering platform: Observability, metrics, evals, prompt management, playground, datasets. Integrates with LlamaIndex, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23
Test your prompts, agents, and RAGs. Use LLM evals to improve your app's quality and catch problems. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
Python package for the evaluation of odometry and SLAM
Building a modern functional compiler from first principles. (http://dev.stephendiehl.com/fun/)
SuperCLUE: 中文通用大模型综合性基准 | A Benchmark for Foundation Models in Chinese
End-to-end Automatic Speech Recognition for Madarian and English in Tensorflow
A unified evaluation framework for large language models
An open-source visual programming environment for battle-testing prompts to LLMs.
UpTrain is an open-source unified platform to evaluate and improve Generative AI applications. We provide grades for 20+ preconfigured checks (covering language, code, embedding use-cases), perform root cause analysis on failure cases and give insights on how to resolve them.
🤗 Evaluate: A library for easily evaluating machine learning models and datasets.
:cloud: :rocket: :bar_chart: :chart_with_upwards_trend: Evaluating state of the art in AI
Avalanche: an End-to-End Library for Continual Learning based on PyTorch.
(IROS 2020, ECCVW 2020) Official Python Implementation for "3D Multi-Object Tracking: A Baseline and New Evaluation Metrics"
Multi-class confusion matrix library in Python
The official GitHub page for the survey paper "A Survey on Evaluation of Large Language Models".
An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.
Evaluation code for various unsupervised automated metrics for Natural Language Generation.
The production toolkit for LLMs. Observability, prompt management and evaluations.
High-fidelity performance metrics for generative models in PyTorch
SemanticKITTI API for visualizing dataset, processing data, and evaluating results.
Open-source evaluation toolkit of large vision-language models (LVLMs), support ~100 VLMs, 30+ benchmarks
Expression evaluation in golang
中文医疗信息处理基准CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark
Evaluate your LLM's response with Prometheus and GPT4 💯
Python implementation of the IOU Tracker
📰 Must-read papers and blogs on LLM based Long Context Modeling 🔥
A Simple Math and Pseudo C# Expression Evaluator in One C# File. Can also execute small C# like scripts
AutoPrompt: Automatic Prompt Construction for Masked Language Models.