iOSWorld:面向个人化智能手机智能体的基准
阅读原文· arxiv.orgiOSWorld 是首个基于持久用户身份构建的原生 iOS 模拟器基准,包含 26 个新开发的互联应用及 133 个任务,分为单应用(27 个)、多应用(60 个,跨 2–8 个应用)和记忆与个性化(46 个,需从个人数据推断模式)三类。在纯视觉和特权视觉+XML 设置下评估前沿及开源模型,最佳准确率 52%(多应用仅 37%);特权 XML 使前沿模型提升最多 26 个百分点,小模型未受益。基准已开源发布。
A useful phone agent needs to be personally intelligent. It should reason over a user's identity, history, and preferences as they exist on the device, not just follow isolated instructions in an impersonal sandbox. Existing mobile agent benchmarks lack this kind of personalization. We introduce iOSWorld, the first interactive native iOS simulator benchmark built around a persistent user identity spanning 26 newly built iOS apps. These apps contain connected data such as transactions, messages, travel records, social relationships, and financial activity. iOSWorld includes 133 tasks across three increasingly difficult categories. Single-app tasks (27) test one app, multi-app tasks (60) span 2 to 8 apps, and memory and personalization tasks (46) require agents to infer patterns from personal data. We evaluate frontier and open-source computer-use models in both vision-only and privileged vision+XML settings. The best configuration reaches 52\% overall but only 37\% on multi-app tasks. Privileged vision+XML access improves frontier models by up to 26 percentage points, while smaller models do not benefit from added accessibility-tree input. We release iOSWorld as an open-source benchmark with all apps, seeded data, tasks, rubrics, and evaluation code.