evaluation-llms

There are 4 repositories under evaluation-llms topic.

OSU-NLP-Group / AttrScore
Code, datasets, models for the paper "Automatic Evaluation of Attribution by Large Language Models"
attribution chatgpt gpt-4 large-language-model large-language-models llms natural-language-processing evaluation-llms
Language:Python 53
RaptorMai / CompBench
CompBench evaluates the comparative reasoning of multimodal large language models (MLLMs) with 40K image pairs and questions across 8 dimensions of relative comparison: visual attribute, existence, state, emotion, temporality, spatiality, quantity, and quality. CompBench covers diverse visual domains, including animals, fashion, sports, and scenes.
benchmark human-annotation large-language-models multimodal-deep-learning multimodal-large-language-models reasoning evaluation-llms foundation-models llms llms-benchmarking vision-and-language vision-language-model
Language:Jupyter Notebook 31

OSU-NLP-Group / AttrScore