# PhoneHarness：混合GUI、CLI与工具动作的手机智能体基准与执行框架

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-12 08:00
- AIHOT 分数：49
- AIHOT 链接：https://aihot.virxact.com/items/cmqg7aocc002ksl9s6d4bng2z
- 原文链接：https://arxiv.org/abs/2606.14832

## AI 摘要

PhoneHarness是一个面向手机智能体的混合动作基准与执行框架，支持GUI、CLI和主机端工具动作的混合路由与可审计执行轨迹。其评测集PhoneHarness Bench要求智能体完成带有可观察副作用的移动工作流，而非仅输出合理答案。在标注评测集上，PhoneHarness达到75.0%通过率，超出最强非PhoneHarness设置12.9个百分点。结果表明，可靠的手机自动化依赖动作表面路由与可验证执行，而非单纯的视觉GUI控制。

## 正文

Phone agents are increasingly expected to complete real mobile workflows rather than merely predict the next screen action. However, much of the current mobile-agent literature still evaluates agents primarily as GUI controllers that observe a screen, emit taps and swipes, and are scored by target app state. Real phone-use tasks are broader: they require deciding when to use app GUIs, device-side commands, or structured tools, while leaving evidence that the intended side effect actually occurred. We introduce PhoneHarness, a mixed-action benchmark and execution harness for studying phone-use agents on verifiable mobile workflows. PhoneHarness runs a device-side agent loop over GUI, CLI, and host-side tool actions, combining deterministic action routing with bounded GUI delegation and auditable execution traces. Its benchmark, PhoneHarness Bench, evaluates whether agents complete tasks with observable side effects, not only whether they produce plausible final answers. On the annotated evaluation split, PhoneHarness reaches a 75.0% pass rate, outperforming the strongest non-PhoneHarness settings by 12.9 percentage points. PhoneHarness and PhoneHarness Bench therefore play distinct but mutually dependent roles: the harness makes mixed phone workflows executable, while the benchmark measures whether agents can use that harness reliably and safely. Our findings suggest that reliable phone automation depends on action-surface routing and verifiable execution, not only visual GUI control.