llms-benchmarking

There are 0 repository under llms-benchmarking topic.

ChemFoundationModels / ChemLLMBench
What can Large Language Models do in chemistry? A comprehensive benchmark on eight tasks
ai4science benchmark chemistry large-language-models llm llms-benchmarking nlp
Language:Jupyter Notebook 114
parea-ai / parea-sdk-py
Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)
llm llm-evaluation llm-tools llmops llms-benchmarking llm-eval llm-evaluation-framework llm-evaluation-toolkit prompt-engineering generative-ai good-first-issue metrics
Language:Python 56
lamalab-org / chem-bench
How good are LLMs at chemistry?
benchmark chemistry llm llms llms-benchmarking machine-learning materials-science safety
Language:Jupyter Notebook 45
FSoft-AI4Code / XMainframe
Language Model for Mainframe Modernization
cobol code-summarization codellm llms-benchmarking mainframe migration
Language:Python 36
epfl-dlab / cc_flows
The data and implementation for the experiments in the paper "Flows: Building Blocks of Reasoning and Collaborating AI".
agents ai competitive-coding competitive-programming competitive-programming-contests llms llms-benchmarking llms-reasoning aiflows
Language:Python 31
RaptorMai / CompBench
CompBench evaluates the comparative reasoning of multimodal large language models (MLLMs) with 40K image pairs and questions across 8 dimensions of relative comparison: visual attribute, existence, state, emotion, temporality, spatiality, quantity, and quality. CompBench covers diverse visual domains, including animals, fashion, sports, and scenes.
benchmark human-annotation large-language-models multimodal-deep-learning multimodal-large-language-models reasoning evaluation-llms foundation-models llms llms-benchmarking vision-and-language vision-language-model
Language:Jupyter Notebook 28
declare-lab / resta
Restore safety in fine-tuned language models through task arithmetic
alignment alignment-algorithm llm llm-safety llm-safety-benchmark llms llms-benchmarking safety
Language:Python 24
minnesotanlp / cobbler
Code and data for Koo et al's ACL 2024 paper "Benchmarking Cognitive Biases in Large Language Models as Evaluators"
bias evaluation llm nlp bias-detection llm-as-a-judge llm-as-evaluator llm-as-judge llm-evaluation llms llms-benchmarking
Language:Jupyter Notebook 14
Paulescu / text-embedding-evaluation
Join 15k builders to the Real-World ML Newsletter ⬇️⬇️⬇️
embeddings llms llms-benchmarking machine-learning
Language:Python 14
logikon-ai / cot-eval
A framework for evaluating the effectiveness of chain-of-thought reasoning in language models.
chain-of-thought gen-ai leaderboard llm llms-benchmarking llms-reasoning
Language:Jupyter Notebook 7
nachoDRT / MERIT-Dataset
The MERIT Dataset is a fully synthetic, labeled dataset created for training and benchmarking LLMs on Visually Rich Document Understanding tasks. It is also designed to help detect biases and improve interpretability in LLMs, where we are actively working. This repository is actively maintained, and new features are continuously being added.
biases layoutlm layoutlmv2 layoutlmv3 layoutxlm llms-benchmarking synthetic-dataset synthetic-dataset-generation token-classification
Language:Python 6
microsoft / MEGAVERSE
Official Codebase for MEGAVERSE: (published in ACL: NAACL 2024)
llms-benchmarking multilingual
Language:Python 5
dippatel1994 / Large-Language-Models-Evaluation-Benchmarks-Collection
This repository contains a list of benchmarks used by big orgs to evaluate their LLMs.
benchmarks large-language-models llm llms llms-benchmarking
4
parea-ai / parea-sdk-ts
TypeScript SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)
llm llm-evaluation llm-evaluation-framework llm-evaluation-toolkit llm-tools llms llms-benchmarking llm-eval prompt-engineering
Language:TypeScript 4
microsoft / private-benchmarking
A platform that enables users to perform private benchmarking of machine learning models. The platform facilitates the evaluation of models based on different trust levels between the model owners and the dataset owners.
benchmarking inference llms-benchmarking mpc private private-benchmarking secure ezpc large-language-models contamination platform confidential-computing trusted-execution-environment
Language:Python 3
PrincySinghal / Html-code-generation-from-LLM
Fine-Tuning and Evaluating a Falcon 7B Model for generating HTML code from input prompts.
code-generation fine-tuning-llm machine-learning llms llms-benchmarking
Language:Jupyter Notebook 3
stair-lab / melt
Evaluation of Language Models in Non-English Languages
llm-evaluation-framework llms-benchmarking
Language:Python 3
lwachowiak / LLMs-for-Social-Robotics
Code and data for our IROS paper: "Are Large Language Models Aligned with People's Social Intuitions for Human–Robot Interactions?"
alignment hri llms llms-benchmarking social-robotics value-alignment vlm
Language:Jupyter Notebook 2
melvinebenezer / Liah-Lie_in_a_haystack
needle in a haystack for LLMs
llms-benchmarking long-context needle-in-haystack llm llm-inference
Language:Python 2
s2e-lab / RegexEval
Source code for the accepted paper in ICSE-NIER'24: Re(gEx|DoS)Eval: Evaluating Generated Regular Expressions and their Proneness to DoS Attacks.
code-generation regex redos-checker redos-detector llms-benchmarking benchmark-framework
Language:Python 2
aflah02 / Humans-v-s-LLM-Benchmarks
LLM Benchmarks play a crucial role in assessing the performance of Language Model Models (LLMs). However, it is essential to recognize that these benchmarks have their own limitations. This interactive tool is designed to engage users in a quiz game based on popular LLM benchmarks, offering an insightful way to explore and understand them
llms llms-benchmarking streamlit
Language:Python 1
Chotom / guardrails-dss-ml-2024
Demo showcase highlighting the capabilities of Guardrails in LLMs.
guardrails llms llms-benchmarking
Language:Python 1
erdemormann / kanarya-and-trendyol-classification-tests
Test results of Kanarya and Trendyol models with and without fine-tuning techniques on the Turkish tweet hate speech detection dataset.
classification hate-speech-detection llms llms-benchmarking tweets python fine-tuning peft-fine-tuning-llm feature-engineering
Language:Jupyter Notebook 1
EvilPsyCHo / Open-LLM-Benchmark
Evaluate open-source language models on Agent, formatted output, command following, long text, multilingual, coding, and custom task capabilities. 开源语言模型在Agent，格式化输出，指令追随，长文本，多语言，代码，自定义任务的能力基准测试。
evaluation-framework huggingface large-language-models llamacpp llm-agent llms-benchmarking openai vllm
Language:Python 1
dinesh-kumar-mr / MediVQA
Part of our final year project work involving complex NLP tasks along with experimentation on various datasets and different LLMs
llms llms-benchmarking medical-application vqa vqa-dataset vqa-med-2018
Language:HTML 0
ParsaHejabi / Simulation-Framework-for-Multi-Agent-Balderdash
A framework using the game Balderdash to evaluate creativity and logical reasoning in Large Language Models (LLMs). Multiple LLMs generate fictitious definitions to deceive others and identify correct ones, analyzing creativity, deception, and performance.
balderdash large-language-models llms llms-benchmarking multi-agent-simulation natural-language-processing nlp
Language:Jupyter Notebook 0
saqib727 / Artifical-Intelligence-Projects
You Can see The Top Artificial Intelligence Projects Based on Real Use cases. 😃 Why wait More when you have all things at one place. 😎
ai disease-prediction fakenews fraudsensor handwritten-digit-recognition heartdisease llms-benchmarking llmsecurity machine-learning machine-translation stockprediction
Language:Jupyter Notebook 0
saqib727 / Blog-assistant
BlogCraft is a web application built with Streamlit that leverages AI to assist in crafting blog posts effortlessly.
blogging llm llms-benchmarking
Language:Python 0
amit-sarker / ICL-Analysis-NLP-685
arithemtic-tasks btlms cerebras huggingface in-context-learning llama2 llms-benchmarking mamba-state-space-models mistral-7b sentiment-analysis
Language:Python
Santhoshi-Ravi / minerva
Evaluating and enhancing Large Language Models (LLMs) using mathematical datasets through innovative Multi-Agent Debate Architecture, without traditional fine-tuning or Retrieval-Augmented Generation techniques. This project explores advanced strategies to boost LLM capabilities in mathematical reasoning.
llm llms-benchmarking multi-agent-debate
Language:Jupyter Notebook
saqib727 / Medical-Analyst
Vital Image Analytics is an AI-powered application designed to assist healthcare professionals in analyzing medical images for diagnostic purposes.
gemini gemini-api llms llms-benchmarking medical-application medical-device-false medical-image-analysis medical-image-processing medical-image-segmentation medical-images medical-imaging medical-informatics
Language:Python
SharathHebbar / eval_llms
eleutherai llm-evaluation llms-benchmarking
Language:Jupyter Notebook

llms-benchmarking

ChemFoundationModels / ChemLLMBench

parea-ai / parea-sdk-py

lamalab-org / chem-bench

FSoft-AI4Code / XMainframe

epfl-dlab / cc_flows

RaptorMai / CompBench

declare-lab / resta

minnesotanlp / cobbler

Paulescu / text-embedding-evaluation

logikon-ai / cot-eval

nachoDRT / MERIT-Dataset

microsoft / MEGAVERSE

dippatel1994 / Large-Language-Models-Evaluation-Benchmarks-Collection

parea-ai / parea-sdk-ts

microsoft / private-benchmarking

PrincySinghal / Html-code-generation-from-LLM

stair-lab / melt

lwachowiak / LLMs-for-Social-Robotics

melvinebenezer / Liah-Lie_in_a_haystack

s2e-lab / RegexEval

aflah02 / Humans-v-s-LLM-Benchmarks

Chotom / guardrails-dss-ml-2024

erdemormann / kanarya-and-trendyol-classification-tests

EvilPsyCHo / Open-LLM-Benchmark

dinesh-kumar-mr / MediVQA

ParsaHejabi / Simulation-Framework-for-Multi-Agent-Balderdash

saqib727 / Artifical-Intelligence-Projects

saqib727 / Blog-assistant

amit-sarker / ICL-Analysis-NLP-685

Santhoshi-Ravi / minerva

saqib727 / Medical-Analyst

SharathHebbar / eval_llms