# LiteCoder-Terminal：构建用于训练语言智能体的可扩展长周期终端环境

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-05-28 08:00
- AIHOT 分数：59
- AIHOT 链接：https://aihot.virxact.com/items/cmpqb0qos03dpslno2lc15hsq
- 原文链接：https://arxiv.org/abs/2605.29559

## AI 摘要

训练能够进行多步规划和动态适应的终端环境语言智能体，其瓶颈在于依赖外部爬取的仓库。研究团队提出了零依赖的合成管道LiteCoder-Terminal-Gen，可从领域规范自主生成可执行、可验证的终端环境。基于此构建了两个大规模资源：包含10个领域、11,255条专家轨迹的SFT数据集，以及拥有602个可验证环境用于轨迹偏好优化的RL环境。在SFT数据集上对通义千问（Qwen）系列模型进行微调后，智能体性能显著提升，其32B变体在Terminal Bench 1.0、2.0和Pro上分别取得29.06%、18.54%和34.00%的pass@1分数。应用Direct Multi-turn Preference Optimization（DMPO）可带来进一步性能提升。

## 正文

Mastering terminal environments requires language agents capable of multi-step planning, feedback-grounded execution, and dynamic state adaptation. However, training such agents is currently bottlenecked by a reliance on scraped external repositories, which limits domain diversity, environment controllability, and the targeting of specific capability deficits. We introduce LiteCoder-Terminal-Gen, a zero-dependency synthesis pipeline that autonomously generates executable and verifiable terminal training environments directly from domain specifications. Using this framework, we construct two large-scale resources: LiteCoder-Terminal-SFT, comprising 11,255 expert trajectories across 10 domains, and LiteCoder-Terminal-RL, featuring 602 verifiable environments for trajectory-level preference optimization. Supervised fine-tuning of Qwen-family models on our SFT dataset yields agents that significantly outperform their base counterparts. Notably, our 32B variant achieves 29.06%, 18.54%, and 34.00% pass@1 on Terminal Bench 1.0, 2.0, and Pro, respectively. Furthermore, applying Direct Multi-turn Preference Optimization (DMPO) on our RL environments yields additional performance gains. These results systematically demonstrate that fully synthetic, executable environments offer a scalable and verifiable supervision signal for mastering complex, real-world command-line workflows.
