There are 4 repositories under evaluation-llms topic.
Code, datasets, models for the paper "Automatic Evaluation of Attribution by Large Language Models"
CompBench evaluates the comparative reasoning of multimodal large language models (MLLMs) with 40K image pairs and questions across 8 dimensions of relative comparison: visual attribute, existence, state, emotion, temporality, spatiality, quantity, and quality. CompBench covers diverse visual domains, including animals, fashion, sports, and scenes.