# PerceptionRubrics： 校准多模态评估以对齐人类感知

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-26 08:00
- AIHOT 分数：47
- AIHOT 链接：https://aihot.virxact.com/items/cmr36nkpd012ksly03czuci31
- 原文链接：https://arxiv.org/abs/2606.28322

## AI 摘要

PerceptionRubrics 提出基于规则的多模态评估框架，将评估从整体语义匹配转向原子化审计。它配套 1,038 张信息密集图像与超过 12,000 条实例特定规则，这些规则源于环形同行评审共识流水线构建的金标准描述，并提炼为“必须正确”与“易错”双流系统。框架采用门控评分机制：强制视觉事实失败触发二值惩罚。评估揭示三大发现：①可靠性差距——模型能正确验证碎片化元素，但在严格合取约束下暴露脆弱性；②开源-闭源分层——前沿模型存在 8% 感知差距；③人类对齐严格性——门控指标远超传统基准。

## 正文

We introduce PerceptionRubrics, a rubric-based evaluation framework that addresses the gap between saturated benchmark scores and real-world brittleness. Shifting evaluation from holistic semantic matching to rigorous atomic auditing, PerceptionRubrics pairs 1,038 information-dense images with over 12,000 instance-specific rubrics. These criteria are derived from golden captions constructed via a novel Circular Peer-Review consensus pipeline and then distilled into a dual-stream system of Must-Right (essential facts) and Easy-Wrong (fine-grained details) rubrics. Crucially, PerceptionRubrics implements a Gated Scoring mechanism: unlike linear averages, failure on mandatory visual facts triggers sharp binary penalties. Extensive evaluation yields critical insights: (1) The Reliability Gap: models often verify fragmented elements correctly yet fail strict conjunctive constraints, exposing brittleness in dense domains; (2) Open-Closed Stratification: contrary to reasoning trends, we reveal a persistent 8% perception deficit between open-source and proprietary frontiers; and (3) Human-Aligned Rigor: our gated metrics substantially out-align conventional benchmarks, validating that strict perceptual fidelity is the prerequisite for reliable generation.
