There are 0 repository under llms-benchmarking topic.
What can Large Language Models do in chemistry? A comprehensive benchmark on eight tasks
Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)
How good are LLMs at chemistry?
Language Model for Mainframe Modernization
CompBench evaluates the comparative reasoning of multimodal large language models (MLLMs) with 40K image pairs and questions across 8 dimensions of relative comparison: visual attribute, existence, state, emotion, temporality, spatiality, quantity, and quality. CompBench covers diverse visual domains, including animals, fashion, sports, and scenes.
Restore safety in fine-tuned language models through task arithmetic
Code and data for Koo et al's ACL 2024 paper "Benchmarking Cognitive Biases in Large Language Models as Evaluators"
Join 15k builders to the Real-World ML Newsletter ⬇️⬇️⬇️
A framework for evaluating the effectiveness of chain-of-thought reasoning in language models.
The MERIT Dataset is a fully synthetic, labeled dataset created for training and benchmarking LLMs on Visually Rich Document Understanding tasks. It is also designed to help detect biases and improve interpretability in LLMs, where we are actively working. This repository is actively maintained, and new features are continuously being added.
This repository contains a list of benchmarks used by big orgs to evaluate their LLMs.
TypeScript SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)
A platform that enables users to perform private benchmarking of machine learning models. The platform facilitates the evaluation of models based on different trust levels between the model owners and the dataset owners.
Fine-Tuning and Evaluating a Falcon 7B Model for generating HTML code from input prompts.
Code and data for our IROS paper: "Are Large Language Models Aligned with People's Social Intuitions for Human–Robot Interactions?"
needle in a haystack for LLMs
LLM Benchmarks play a crucial role in assessing the performance of Language Model Models (LLMs). However, it is essential to recognize that these benchmarks have their own limitations. This interactive tool is designed to engage users in a quiz game based on popular LLM benchmarks, offering an insightful way to explore and understand them
Demo showcase highlighting the capabilities of Guardrails in LLMs.
Test results of Kanarya and Trendyol models with and without fine-tuning techniques on the Turkish tweet hate speech detection dataset.
Evaluate open-source language models on Agent, formatted output, command following, long text, multilingual, coding, and custom task capabilities. 开源语言模型在Agent,格式化输出,指令追随,长文本,多语言,代码,自定义任务的能力基准测试。
Part of our final year project work involving complex NLP tasks along with experimentation on various datasets and different LLMs
A framework using the game Balderdash to evaluate creativity and logical reasoning in Large Language Models (LLMs). Multiple LLMs generate fictitious definitions to deceive others and identify correct ones, analyzing creativity, deception, and performance.
You Can see The Top Artificial Intelligence Projects Based on Real Use cases. 😃 Why wait More when you have all things at one place. 😎
BlogCraft is a web application built with Streamlit that leverages AI to assist in crafting blog posts effortlessly.
Evaluating and enhancing Large Language Models (LLMs) using mathematical datasets through innovative Multi-Agent Debate Architecture, without traditional fine-tuning or Retrieval-Augmented Generation techniques. This project explores advanced strategies to boost LLM capabilities in mathematical reasoning.
Vital Image Analytics is an AI-powered application designed to assist healthcare professionals in analyzing medical images for diagnostic purposes.