TerminalWorld：在真实终端任务上评测智能体

2026-05-21 08:00·43天前

AI 摘要

研究团队发布了TerminalWorld，一个可扩展的数据引擎，能自动从大量真实终端录制中逆向工程生成高保真的评估任务。该引擎处理了80,870份录制，产出了涵盖18个类别、1,280个唯一命令的1,530个任务基准。其中包含一个经过人工复核的200个任务子集。测试显示，当前先进的模型与智能体在真实终端工作流上表现欠佳，最高通过率仅为62.5%。该基准衡量的能力与现有专家设计基准的相关性很弱（r=0.20），凸显其独特价值。引擎的自动化设计使其具备真实性与可扩展性，数据与代码已开源。

原文 · 未翻译

We introduce TerminalWorld, a scalable data engine that automatically reverse-engineers high-fidelity evaluation tasks from "in-the-wild" terminal recordings. Processing 80,870 terminal recordings, the engine yields a full benchmark of 1,530 validated tasks, spanning 18 real-world categories, ranging from short everyday operations to workflows exceeding 50 steps, and covering 1,280 unique commands. From these, we curate a Verified subset of 200 representative, manually reviewed tasks. Comprehensive benchmarking on TerminalWorld-Verified across eight frontier models and six agents reveals that current systems still struggle with authentic terminal workflows, achieving a maximum pass rate of only 62.5%. Moreover, TerminalWorld captures real-world terminal capabilities distinct from existing expert-curated benchmarks (e.g., Terminal-Bench), with only a weak correlation to their scores (Pearson r=0.20). The automated engine makes TerminalWorld authentic and scalable by construction, enabling it to evaluate agents in real-world terminal environments as developer practices evolve. Data and code are available at https://github.com/EuniAI/TerminalWorld.

HuggingFace Daily Papers（社区热门论文）

64导出 Markdown

TerminalWorld：在真实终端任务上评测智能体

2026-05-21 08:00·43天前

阅读原文· arxiv.org

AI 摘要

原文 · 保持原样，未翻译