# 计算机使用智能体的可靠性研究

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-04-20 08:00
- AIHOT 链接：https://aihot.virxact.com/items/cmo8uj1c7073jslmla7ys9aze
- 原文链接：https://arxiv.org/abs/2604.17849

## AI 摘要

计算机使用智能体虽在网页导航与桌面自动化等任务中表现优异，却面临执行可靠性挑战——即使任务与模型不变，单次成功无法保证重复运行稳定。研究基于 OSWorld 平台对相同任务进行多次执行测试，通过配对统计分析发现：可靠性受执行随机性、任务规范模糊性及行为变异性三重因素影响，其关键在于任务定义方式与跨执行行为的一致性。研究建议采用重复执行评估机制，并优先选择在多次运行中保持稳定的策略。

## 正文

Computer-use agents have rapidly improved on real-world tasks such as web navigation, desktop automation, and software interaction, in some cases surpassing human performance. Yet even when the task and model are unchanged, an agent that succeeds once may fail on a repeated execution of the same task. This raises a fundamental question: if an agent can succeed at a task once, what prevents it from doing so reliably? In this work, we study the sources of unreliability in computer-use agents through three factors: stochasticity during execution, ambiguity in task specification, and variability in agent behavior. We analyze these factors on OSWorld using repeated executions of the same task together with paired statistical tests that capture task-level changes across settings. Our analysis shows that reliability depends on both how tasks are specified and how agent behavior varies across executions. These findings suggest the need to evaluate agents under repeated execution, to allow agents to resolve task ambiguity through interaction, and to favor strategies that remain stable across runs.
