感知还是偏见:多模态大语言模型能否超越人格的第一印象?
阅读原文· arxiv.org多模态大语言模型在需要人格感知的人机交互中应用广泛,但现有评估仅关注大五人格分数的预测。本研究提出了一个新的接地人格推理任务,并发布了包含1104个视频的MM-OCEAN数据集。研究通过三层评估框架对27个模型进行测试,发现一个关键的“偏见鸿沟”:在所有模型中,有51%的正确评分并未基于检索到的行为线索,且整体证据归因率仅在0-33.5%之间。这表明模型往往只是“猜对”了分数,而非基于正确的推理依据,为未来提升模型的接地社交认知能力指明了方向。
Multimodal Large Language Models (MLLMs) are increasingly deployed in human-facing roles where personality perception is critical, yet existing benchmarks evaluate this capability solely on numerical Big Five score prediction, leaving open whether models truly perceive personality through behavioral understanding or merely prejudge through superficial pattern matching. We address this gap with three contributions. (i) A new task: we formalize Grounded Personality Reasoning (GPR), which requires MLLMs to anchor each Big Five rating in observable evidence through a chain of rating, reasoning, and grounding. (ii) A new dataset: we release MM-OCEAN (1,104 videos, 5,320 MCQs), produced by a multi-agent pipeline with human verification, with timestamped behavioral observations, evidence-grounded trait analyses, and seven categories of cue-grounding MCQs. (iii) Benchmark and analysis: we design a three-tier evaluation (rating, reasoning, grounding) plus four sample-level failure-mode metrics: Prejudice Rate (PR), Confabulation Rate (CR), Integration-failure Rate (IR), and Holistic-grounding Rate (HR), and benchmark 27 MLLMs (13 closed, 14 open). The analysis uncovers a striking Prejudice Gap: across the field, 51% of correct ratings are not grounded in retrieved cues, and the Holistic-Grounding Rate spans only 0-33.5%. These findings expose a disconnect between getting the right score and reasoning for the right reason, charting a roadmap for grounded social cognition in MLLMs.