PaintBench：精确视觉编辑的确定性评估

2026-05-29 08:00·35天前

AI 摘要

PaintBench是一个动态可扩展的基准，涵盖几何变换、结构操作、颜色变化、符号推理四类共20种精确视觉编辑操作。它通过程序化生成与可配置复杂度实现无限、抗污染的评估套件，并采用确定性像素级评估（mIoU）。在11个图像编辑模型上，当前最高性能的行业领先模型仅取得17.1% mIoU。任务分解显示几何变换、大部分结构操作和基于公式的颜色变化尤为困难，且模型存在针对性专长。场景变化（如物体数量、背景复杂度、配色方案、编辑区域大小）会导致性能下降。通过另一个确定性评估基准TinyGrafixBench验证，PaintBench得分与应用任务表现存在强线性相关（R²=0.91，p<0.001）。

原文 · 未翻译

While current multimodal models are proficient at open-ended visual editing, executing precise single-answer edits remains an important obstacle. To probe this challenge, we introduce PaintBench, a dynamically scalable benchmark targeting 20 fundamental precise visual editing operations across four categories: geometric transformation, structural manipulation, color change, and symbolic reasoning. Procedural generation with configurable complexity enables an effectively infinite, contamination-resistant evaluation suite, and deterministic pixel-level evaluation eliminates reliance on bias-prone judge models. Across 11 image editing models, we find overall low performance, with the current highest-performing industry leader scoring only 17.1% (mIoU). Task decomposition reveals especially challenging operation types (geometric transformation, most structural manipulation, formula-based color change) and model-specific specializations. Fine-grained benchmark diagnostics further show performance degradations induced by scene variations in object count, background complexity, color scheme, and edit-region size. To test generalization of PaintBench scores to applied task performance, we create a procedural, deterministic evaluation for data visualization editing (TinyGrafixBench) and find strong linear correlation with PaintBench scores (R^2 = 0.91, p < 0.001). Altogether, PaintBench provides a rigorous foundation for measuring and driving progress in precise multimodal visual editing.

HuggingFace Daily Papers（社区热门论文）

66导出 Markdown

PaintBench：精确视觉编辑的确定性评估

2026-05-29 08:00·35天前

阅读原文· arxiv.org

AI 摘要

原文 · 保持原样，未翻译