MM-JudgeBias:评估多模态大语言模型评判器组合偏见的基准测试
阅读原文· arxiv.org研究团队提出MM-JudgeBias基准,用于评估MLLM-as-a-Judge的组合偏见。该基准通过Query、Image、Response三维度受控扰动,结合Bias-Deviation与Bias-Conformity指标,对26个主流模型进行测试。数据集涵盖29个源基准的1800余个样本,可细粒度诊断9种偏见类型。实验揭示模型存在系统性模态忽视与不对称评估倾向,表明当前MLLM评判器在证据缺失或扰动下可靠性不足。
Multimodal Large Language Models (MLLMs) have been increasingly used as automatic evaluators-a paradigm known as MLLM-as-a-Judge. However, their reliability and vulnerabilities to biases remain underexplored. We find that many MLLM judges fail to reliably integrate key visual or textual cues, yielding unreliable evaluations when evidence is missing or mismatched, and exhibiting instability under semantically irrelevant perturbations. To address this, we systematically define Compositional Bias in MLLM-as-a-Judge systems and introduce MM-JudgeBias, a benchmark for evaluating it. MM-JudgeBias introduces controlled perturbations across Query, Image, and Response, and evaluates model behavior via two complementary metrics: Bias-Deviation (BD) for sensitivity and Bias-Conformity (BC) for stability. Our dataset of over 1,800 curated and refined multimodal samples, drawn from 29 source benchmarks, enables a fine-grained diagnosis of nine bias types across diverse tasks and domains. Experiments on 26 state-of-the-art MLLMs reveal systematic modality neglect and asymmetric evaluation tendencies, underscoring the need for more reliable judges.