There are 0 repository under llms-benchmarking topic.
🔥 A list of tools, frameworks, and resources for building AI web agents
An extensible benchmark for evaluating large language models on planning
BackdoorLLM: A Comprehensive Benchmark for Backdoor Attacks and Defenses on Large Language Models
A comprehensive set of LLM benchmark scores and provider prices.
What can Large Language Models do in chemistry? A comprehensive benchmark on eight tasks
Benchmark that evaluates LLMs using 759 NYT Connections puzzles extended with extra trick words
Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)
How good are LLMs at chemistry?
Language Model for Mainframe Modernization
Thematic Generalization Benchmark: measures how effectively various LLMs can infer a narrow or specific "theme" (category/rule) from a small set of examples and anti-examples, then detect which item truly fits that theme among a collection of misleading candidates.
Awesome Mixture of Experts (MoE): A Curated List of Mixture of Experts (MoE) and Mixture of Multimodal Experts (MoME)
[NeurIPS'25] MLLM-CompBench evaluates the comparative reasoning of MLLMs with 40K image pairs and questions across 8 dimensions of relative comparison: visual attribute, existence, state, emotion, temporality, spatiality, quantity, and quality. CompBench covers diverse visual domains, including animals, fashion, sports, and scenes
Training and Benchmarking LLMs for Code Preference.
The data and implementation for the experiments in the paper "Flows: Building Blocks of Reasoning and Collaborating AI".
Restore safety in fine-tuned language models through task arithmetic
A minimalist benchmarking tool designed to test the routine-generation capabilities of LLMs.
Code and data for Koo et al's ACL 2024 paper "Benchmarking Cognitive Biases in Large Language Models as Evaluators"
Plancraft is a minecraft environment and agent suite to test planning capabilities in LLMs
A framework for evaluating the effectiveness of chain-of-thought reasoning in language models.
Join 15k builders to the Real-World ML Newsletter ⬇️⬇️⬇️
FaceXBench: Evaluating Multimodal LLMs on Face Understanding
A platform that enables users to perform private benchmarking of machine learning models. The platform facilitates the evaluation of models based on different trust levels between the model owners and the dataset owners.
The MERIT Dataset is a fully synthetic, labeled dataset created for training and benchmarking LLMs on Visually Rich Document Understanding tasks. It is also designed to help detect biases and improve interpretability in LLMs, where we are actively working. This repository is actively maintained, and new features are continuously being added.
Fine-Tuning and Evaluating a Falcon 7B Model for generating HTML code from input prompts.
Official repository for "RoMath: A Mathematical Reasoning Benchmark in 🇷🇴 Romanian 🇷🇴"
Benchmarking LLMs (large language models) on leetcode algorithmic challenges
OpenPromptBank is an AI prompt library platform where users can explore, rank, and contribute AI prompts categorized by various topics. This platform features a searchable library, community-driven rankings, prompt performance benchmarks, and user profiles.
This repository contains a list of benchmarks used by big orgs to evaluate their LLMs.
Code and data for our IROS paper: "Are Large Language Models Aligned with People's Social Intuitions for Human–Robot Interactions?"
TypeScript SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)
GenderBench - Evaluation suite for gender biases in LLMs