# Claw-Anything：评测能够广泛访问用户数字世界的全天候个人助手基准

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-05-25 08:00
- AIHOT 分数：66
- AIHOT 链接：https://aihot.virxact.com/items/cmpm2fnvm0kz2sl01c6uxt6c3
- 原文链接：https://arxiv.org/abs/2605.26086

## AI 摘要

当前大语言模型智能体作为全天候个人助手，只能访问用户数字世界的有限部分，限制了其情境推理能力。Claw-Anything基准测试旨在解决此问题，它从长期活动历史、相互依赖的后端服务以及跨设备集成GUI与CLI交互三个维度扩展智能体上下文。该基准通过模拟数月用户活动生成包含复杂状态与噪声的训练环境。实验显示，GPT-5.5在该基准上的 pass@1 仅为34.5%，远低于其在之前基准上的表现，表明现有智能体能力与全天候助手需求存在显著差距。研究团队同步开源了一个自动化数据生成管道，可产出2000个训练环境，并使基础模型性能提升23.7%。

## 正文

Large language model agents are increasingly envisioned as always-on personal assistants with access to anything relevant in the user's digital world. Yet current systems operate over only narrow slices of that world, limiting context-sensitive reasoning and effective assistance. Existing benchmarks similarly provide only partial user state and therefore fail to capture performance in such a broad, always-on setting. To address this gap, we introduce Claw-Anything, a benchmark that expands agent context along three dimensions: long-horizon activity histories, interdependent backend services, and integrated GUI and CLI interaction across multiple devices. To instantiate this setting, we simulate months of user activity through multi-round event injection, producing complex world states and realistic noise, including irrelevant events and conflicting signals. Agents must reason over rich contextual environments while remaining robust to such noise. This expanded scope also enables the evaluation of proactive assistance, requiring agents to anticipate user needs and deliver timely recommendations. Experiments show that GPT-5.5 achieves only 34.5% pass@1, substantially below prior benchmarks, underscoring a gap between current agent capabilities and the demands of always-on personal assistance. Alongside the benchmark, we release an automated data-generation pipeline that yields 2,000 training environments and improves the base model by 23.7%, demonstrating its utility of scalable data infrastructure.
