# ClawEnvKit：面向爪形智能体的自动环境生成工具包

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-04-20 08:00
- AIHOT 链接：https://aihot.virxact.com/items/cmo84qcqu03wdslmlf8oxun17
- 原文链接：https://arxiv.org/abs/2604.18543

## AI 摘要

ClawEnvKit是面向爪形智能体的自动环境生成管道，通过解析器、生成器和验证器将自然语言转化为多样化、经验证的环境。基于该工具构建的Auto-ClawEval基准包含1,040个环境，覆盖24个类别，成本较人工降低13,800倍且质量相当。跨4个模型家族和8个智能体框架的评估显示，工具链工程较裸ReAct基线提升性能达15.7个百分点。该工具支持实时评估和按需训练环境生成，可根据智能体弱点自适应调整任务分布。

## 正文

Constructing environments for training and evaluating claw-like agents remains a manual, human-intensive process that does not scale. We argue that what is needed is not just a dataset, but an automated pipeline capable of generating diverse, verified environments on demand. To this end, we introduce ClawEnvKit, an autonomous generation pipeline that instantiates this formalism from natural language descriptions. The pipeline comprises three modules: (1) a parser that extracts structured generation parameters from natural language input; (2) a generator that produces the task specification, tool interface, and scoring configuration; and (3) a validator that enforces feasibility, diversity, structural validity, and internal consistency across the generated environments. Using ClawEnvKit, we construct Auto-ClawEval, the first large-scale benchmark for claw-like agents, comprising 1,040 environments across 24 categories. Empirically, Auto-ClawEval matches or exceeds human-curated environments on coherence and clarity at 13,800x lower cost. Evaluated across 4 model families and 8 agent harness frameworks, we find that harness engineering boosts performance by up to 15.7 percentage points over a bare ReAct baseline, completion remains the primary axis of variation with no model saturating the benchmark, and automated generation enables evaluation at a scale previously infeasible. Beyond static benchmarking, ClawEnvKit enables live evaluation: users describe a desired capability in natural language and obtain a verified environment on demand, turning evaluation into a continuous, user-driven process. The same mechanism serves as an on-demand training environment generator, producing task distributions that adapt to an agent's current weaknesses rather than being bounded by existing user logs.
