# CLI-Universe：面向终端智能体的可验证任务合成引擎

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-22 08:00
- AIHOT 分数：50
- AIHOT 链接：https://aihot.virxact.com/items/cmqq6uzx5075cslp5dammbrqo
- 原文链接：https://arxiv.org/abs/2606.22883

## AI 摘要

CLI-Universe是一个原则性合成引擎，通过多维能力分类树采样并基于真实技术材料进行证据引导深度研究，生成候选终端智能体任务。候选任务经Docker实例化后，通过rubric-gated测试构造、hint-conditional过滤和严格fail-to-pass检查等多阶段可执行验证流水线，约三分之二的候选被丢弃，仅保留真实、可验证且有难度的任务。基于此构建的6,000条轨迹数据集CLI-Universe-6K，微调Qwen3-32B后在Terminal-Bench 2.0上达到33.4%准确率，创下开源数据训练的32B及以下参数模型新SOTA，并超越多个参数规模大一个数量级的模型。

## 正文

While recent LLM-based terminal agents have demonstrated promising capabilities, the scarcity of high-quality, executable training data remains a critical bottleneck. Existing synthesis pipelines typically scale by retrofitting surface-level artifacts into tasks, frequently yielding ambiguous instructions, shallow execution paths, and brittle tests that provide weak learning signals. To overcome this, we introduce CLI-Universe, a principled synthesis engine that constructs terminal-agent tasks. CLI-Universe generates candidate tasks by sampling combinations across a multi-dimensional capability taxonomy (domain, skill type, capability, and engineering pillar), then grounds each candidate through evidence-guided deep research over real-world technical materials. To ensure rigorous supervision, validated blueprints are instantiated into Dockerized environments and subjected to a multi-stage executable verification pipeline featuring rubric-gated test construction, hint-conditional filtering, and strict fail-to-pass checking. Across the full pipeline, from candidate generation to verification, approximately two-thirds of candidates are discarded, retaining only those that are genuine, verifiable, and non-trivially challenging. To validate our framework, we instantiate a highly distilled dataset of 6,000 trajectories called CLI-Universe-6K. Remarkably, fine-tuning Qwen3-32B on CLI-Universe-6K achieves 33.4% on Terminal-Bench 2.0. This sets a new state-of-the-art for models trained on open-source data at or below 32B parameters, and outperforms several models an order of magnitude larger, demonstrating the profound data efficiency of structured, high-fidelity synthesis.
