# 信任正确的教师：面向GUI Grounding的质量感知自蒸馏

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-16 08:00
- AIHOT 分数：44
- AIHOT 链接：https://aihot.virxact.com/items/cmqiygkr105h8sl5wo464kymi
- 原文链接：https://arxiv.org/abs/2606.18101

## AI 摘要

GUI grounding要求视觉语言模型在高分辨率截图中识别小目标并预测精确坐标。OPSD（在策略自蒸馏）虽能提供密集token级教师信号，但朴素OPSD中学生生成前缀偏离目标时坐标token信号质量下降。本文提出质量感知自蒸馏，通过软正确性感知门控和教师概率缩放改善信号质量：门控检查教师当前坐标预测能否在给定前缀下完成到真实框，否则降权；教师概率缩放用置信度校准监督强度。两个组件单独无效，组合持续有效。在六个GUI grounding基准上一致提升基础模型并超越强基线。

## 正文

Graphical user interface (GUI) grounding requires vision-language models (VLMs) to identify small target elements in high-resolution screenshots and predict precise screen coordinates. On-policy self-distillation (OPSD) is a promising post-training approach for this coordinate-sensitive task, since it provides dense token-level teacher signals beyond hard coordinate labels. However, naive OPSD is not well suited to GUI grounding: OPSD evaluates the teacher on student-generated prefixes, the quality of coordinate-token teacher signals can degrade when the prefix has already deviated from the target coordinate, leading to unreliable teacher signal. To mitigate this, We propose quality-aware self-distillation for VLM-based GUI grounding, which improves coordinate-token teacher-signal quality through soft correctness-aware gating and teacher-probability scaling. The soft correctness-aware gate checks whether the teacher's current coordinate-token prediction can still be completed into the ground-truth box under the student-generated prefix. If not, the corresponding teacher signal is down-weighted. Teacher-probability scaling then uses the teacher's confidence as a lightweight factor to further calibrate the strength of the gated supervision. A key empirical finding is that neither component alone improves overall performance, whereas combining them consistently improves performance. This suggests that the two mechanisms play complementary roles: correctness-aware gating suppresses unreliable coordinate-token supervision, while teacher-probability scaling calibrates the strength of the remaining signals. Experiments across six GUI grounding benchmarks show that our method consistently improves the base model and outperforms strong baselines.