# SimuWoB：模拟真实移动应用以实现快速可靠的GUI智能体评测

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-05-24 08:00
- AIHOT 分数：58
- AIHOT 链接：https://aihot.virxact.com/items/cmpmd6ghh0noisl01ap3q2dmj
- 原文链接：https://arxiv.org/abs/2605.25160

## AI 摘要

SimuWoB是一个为移动GUI智能体设计的完全合成基准测试，包含120个跨越不同类型和难度等级的任务。它通过一个框架生成高保真任务和虚拟环境，并为每个任务自动提供有效奖励；环境以无后端的网页形式部署，可通过URL访问，以实现高效、可复现的评估。实验显示，在最先进的移动GUI智能体上，平均成功率仅为27.92%，长视野任务的成功率降至17.82%，揭示了当前智能体在复杂场景下的不足。与真实世界样本任务的评估结果对比表明，基于此合成环境的评估具有良好的泛化能力。

## 正文

Mobile GUI agents powered by large language models have progressed rapidly, creating urgent needs for realistic and comprehensive evaluation. Existing benchmarks prioritize reproducibility but are often limited to open-source apps or file-operation tasks for the difficulty of constructing rewards on real applications, leaving a gap between benchmark settings and real-world usage. Moreover, most benchmarks focus on basic grounding and navigation, with limited coverage of complex, long-horizon interactions. To address these limitations, we introduce SimuWoB, a fully synthetic benchmark for mobile GUI agents with 120 challenging tasks spanning diverse types and difficulty levels. We build a robust virtual environment generation framework that synthesizes high-fidelity tasks and environments, and automatically provides valid rewards for each task. Each environment is deployed as a backend-free webpage accessible via URL, enabling efficient and reproducible evaluation. We conduct comprehensive experiments on several state-of-the-art mobile GUI agents. The average success rate is only 27.92%, dropping to 17.82% on long-horizon tasks, which reveals substantial weaknesses in current agents under complex scenarios. Evaluation result comparison with real-world sample tasks demonstrate that agent assessments based on our synthetic environment generalize well. We further provide diagnostic insights across key capability dimensions and discuss implications for future mobile GUI agent development.
