ABACUS：适配统一基础模型以桥接图像计数理解与生成

2026-06-22 08:00·11天前

AI 摘要

ABACUS是一个统一的视觉语言模型，无需基准特定训练即可处理对象计数、人群计数、指代表达式计数和计数忠实的图像生成。它基于3B参数基础模型，通过三项创新适配目标定位：基于目标图的密度感知自适应缩放实现空间定位；GRPO边界感知计数策略消除裁剪边界错误；循环一致GRPO策略让理解分支自我批判生成输出，无需外部标注缩小理解-生成差距。在七个基准上取得SOTA，超越任务专用专家和更大通用模型。

原文 · 未翻译

ABACUS is a unified vision-language model that handles object counting, crowd counting, referring-expression counting, and count-faithful image generation without any benchmark-specific training required. Our model is built on existing 3B-parameter unified foundation model and is adapted for object localization tasks using three key innovations: density-aware adaptive zooming with objectness maps for spatial grounding; a boundary-aware count policy via GRPO to eliminate crop-boundary errors; and a cycle-consistent GRPO strategy where the understanding branch self-critiques generated outputs, closing the understanding-generation gap without any external annotations. ABACUS achieves state-of-the-art results across seven benchmarks, outperforming both task-specific specialists and larger generalist models.

HuggingFace Daily Papers（社区热门论文）

43导出 Markdown

ABACUS：适配统一基础模型以桥接图像计数理解与生成

2026-06-22 08:00·11天前

阅读原文· arxiv.org

AI 摘要

原文 · 保持原样，未翻译