Rohan Paul@rohanpaul_ai · 4月30日54The paper proposes a way for a coding agent to rewrite its own tools and rules, then check whether each change really helped.
The big deal is that it turns harness tuning from guesswork into an auditable experiment, so the part of agent systems that quietly eats the most time and effort can now improve itself in a controlled and measurable way.
The problem is that agent harnesses, meaning the prompts, tools, memory, and rules around a model, are usually tuned by hand or changed through messy self-improvement loops that produce lots of edits but little clear evidence about what helped.
The method, called Agentic Harness Engineering, turns those edits into file-level parts that can be changed or rolled back, compresses huge run logs into short failure evidence, and makes the agent write a prediction for each edit that later gets checked against real task results.
They tested this on Terminal-Bench 2, a hard coding benchmark in a terminal, by starting from a very small shell-only harness and letting the loop run for 10 rounds while keeping the base model fixed.
The single-try success rate rose from 69.7% to 77.0%, beating Codex-CLI at 71.9% and other self-evolving baselines, which suggests the gains came from better harness design rather than from swapping in a stronger model.
The final harness also carried over to other models and to SWE-bench-verified, with gains of 5.1 to 10.1 points across model families and 12% fewer tokens than the seed on SWE-bench-verified, which matters because harness work is expensive and this gives a more reliable way to let that layer improve itself without drifting into random noise.
----
Paper Link – arxiv. org/abs/2604.25850
Paper Title: "Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses"
译本文提出Agentic Harness Engineering方法,使编码代理能自动重写自身工具和规则,并通过可审计实验验证每次更改的有效性。传统代理工具调整依赖手动或混乱自我改进循环,缺乏明确证据。该方法将编辑转化为文件级可回滚部分,压缩运行日志为简短失败证据,并让代理为编辑写预测后基于任务结果检查。在Terminal-Bench 2测试中,从小型shell-only工具开始,经10轮进化且基础模型固定,单次尝试成功率从69.7%提升至77.0%,超越其他基线。最终工具可迁移至其他模型和SWE-bench-verified任务,在不同模型家族获得5.1到10.1点提升,并减少12%令牌使用,为昂贵工具工作提供可靠、可控的自我改进途径。