# VISTA：基于视图一致的自验证训练实现GUI定位

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-12 08:00
- AIHOT 分数：48
- AIHOT 链接：https://aihot.virxact.com/items/cmqeknks1034wslun14rg0ixa
- 原文链接：https://arxiv.org/abs/2606.14579

## AI 摘要

将GRPO直接用于GUI定位时，单视图采样会导致困难实例全失败、简单实例全成功，无法产生有效相对优势。VISTA提出GRPO训练框架，从同一GUI实例的多个目标保留视图中构建对比组——每个视图通过裁剪保持目标元素可见并精确重映射边界框。VISTA还引入自验证交叉视图锚点，使用优势加权损失优化Oracle答案，不纳入群组基线。在五个GUI定位基准和多种Qwen骨干上，VISTA一致提升精度：ScreenSpot-Pro上，Qwen3-VL 4B/8B/30B-A3B从55.5/52.7/53.7提升至63.4/65.8/67.0。鲁棒性分析显示更高最差视图准确率和更低预测翻转率。

## 正文

When applying Group Relative Policy Optimization (GRPO) for GUI Grounding, rollouts are sampled from a single screenshot view; groups often become either all failures on difficult instances or all successes on easy ones, yielding no useful relative advantage. We propose VISTA (View-Consistent Self-Verified Training), a GRPO-based training framework that constructs each comparison group from multiple target-preserving views of the same GUI instance.Each view is generated by a crop that keeps the target element visible and remaps its box exactly, so model rollouts are compared across semantically equivalent but geometrically different inputs. To stabilize short coordinate generation without turning reinforcement learning into unconditional imitation, VISTA further adds a self-verified cross-view anchor: an oracle answer optimized with an advantage-weighted loss, excluded from the group baseline and activated only when the model has produced a maximum-reward rollout. Across five GUI-grounding benchmarks and multiple Qwen backbones, VISTA consistently improves grounding accuracy.On ScreenSpot-Pro, it raises Qwen3-VL 4B/8B/30B-A3B from 55.5/52.7/53.7 to 63.4/65.8/67.0. Robustness analyses further show higher worst-view accuracy and lower prediction flip rates.