ABACUS:适配统一基础模型以桥接图像计数理解与生成
阅读原文· arxiv.orgABACUS是一个统一的视觉语言模型,无需基准特定训练即可处理对象计数、人群计数、指代表达式计数和计数忠实的图像生成。它基于3B参数基础模型,通过三项创新适配目标定位:基于目标图的密度感知自适应缩放实现空间定位;GRPO边界感知计数策略消除裁剪边界错误;循环一致GRPO策略让理解分支自我批判生成输出,无需外部标注缩小理解-生成差距。在七个基准上取得SOTA,超越任务专用专家和更大通用模型。
ABACUS is a unified vision-language model that handles object counting, crowd counting, referring-expression counting, and count-faithful image generation without any benchmark-specific training required. Our model is built on existing 3B-parameter unified foundation model and is adapted for object localization tasks using three key innovations: density-aware adaptive zooming with objectness maps for spatial grounding; a boundary-aware count policy via GRPO to eliminate crop-boundary errors; and a cycle-consistent GRPO strategy where the understanding branch self-critiques generated outputs, closing the understanding-generation gap without any external annotations. ABACUS achieves state-of-the-art results across seven benchmarks, outperforming both task-specific specialists and larger generalist models.