# UI-Zoomer：面向 GUI Grounding 的不确定性驱动自适应放大方法

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-04-15 08:00
- AIHOT 链接：https://aihot.virxact.com/items/cmo0vpnos02w1sli2egdexd15
- 原文链接：https://arxiv.org/abs/2604.14113

## AI 摘要

UI-Zoomer是一种无需训练的自适应放大框架，通过不确定性量化优化GUI定位任务。该方法利用置信度感知门控机制仅在定位不确定时触发放大，并基于方差分解动态计算每实例的裁剪半径，替代传统的固定尺寸统一裁剪。在ScreenSpot-Pro、UI-Vision和ScreenSpot-v2基准测试中，该方法分别实现最高13.4%、10.3%和4.2%的精度提升，显著改善小图标与密集布局的定位效果。

## 正文

GUI grounding, which localizes interface elements from screenshots given natural language queries, remains challenging for small icons and dense layouts. Test-time zoom-in methods improve localization by cropping and re-running inference at higher resolution, but apply cropping uniformly across all instances with fixed crop sizes, ignoring whether the model is actually uncertain on each case. We propose UI-Zoomer, a training-free adaptive zoom-in framework that treats both the trigger and scale of zoom-in as a prediction uncertainty quantification problem. A confidence-aware gate fuses spatial consensus among stochastic candidates with token-level generation confidence to selectively trigger zoom-in only when localization is uncertain. When triggered, an uncertainty-driven crop sizing module decomposes prediction variance into inter-sample positional spread and intra-sample box extent, deriving a per-instance crop radius via the law of total variance. Extensive experiments on ScreenSpot-Pro, UI-Vision, and ScreenSpot-v2 demonstrate consistent improvements over strong baselines across multiple model architectures, achieving gains of up to +13.4\%, +10.3\%, and +4.2\% respectively, with no additional training required.
