PerceptionRubrics：校准多模态评估以对齐人类感知

2026-06-26 08:00·7天前

AI 摘要

PerceptionRubrics 提出基于规则的多模态评估框架，将评估从整体语义匹配转向原子化审计。它配套 1,038 张信息密集图像与超过 12,000 条实例特定规则，这些规则源于环形同行评审共识流水线构建的金标准描述，并提炼为“必须正确”与“易错”双流系统。框架采用门控评分机制：强制视觉事实失败触发二值惩罚。评估揭示三大发现：①可靠性差距——模型能正确验证碎片化元素，但在严格合取约束下暴露脆弱性；②开源-闭源分层——前沿模型存在 8% 感知差距；③人类对齐严格性——门控指标远超传统基准。

原文 · 未翻译

We introduce PerceptionRubrics, a rubric-based evaluation framework that addresses the gap between saturated benchmark scores and real-world brittleness. Shifting evaluation from holistic semantic matching to rigorous atomic auditing, PerceptionRubrics pairs 1,038 information-dense images with over 12,000 instance-specific rubrics. These criteria are derived from golden captions constructed via a novel Circular Peer-Review consensus pipeline and then distilled into a dual-stream system of Must-Right (essential facts) and Easy-Wrong (fine-grained details) rubrics. Crucially, PerceptionRubrics implements a Gated Scoring mechanism: unlike linear averages, failure on mandatory visual facts triggers sharp binary penalties. Extensive evaluation yields critical insights: (1) The Reliability Gap: models often verify fragmented elements correctly yet fail strict conjunctive constraints, exposing brittleness in dense domains; (2) Open-Closed Stratification: contrary to reasoning trends, we reveal a persistent 8% perception deficit between open-source and proprietary frontiers; and (3) Human-Aligned Rigor: our gated metrics substantially out-align conventional benchmarks, validating that strict perceptual fidelity is the prerequisite for reliable generation.

HuggingFace Daily Papers（社区热门论文）

47导出 Markdown

PerceptionRubrics：校准多模态评估以对齐人类感知

2026-06-26 08:00·7天前

阅读原文· arxiv.org

AI 摘要

原文 · 保持原样，未翻译

PerceptionRubrics： 校准多模态评估以对齐人类感知

PerceptionRubrics： 校准多模态评估以对齐人类感知

PerceptionRubrics：校准多模态评估以对齐人类感知

PerceptionRubrics：校准多模态评估以对齐人类感知