# LLM程序修复代理中代码执行成本效益的实证研究

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-25 08:00
- AIHOT 分数：66
- AIHOT 链接：https://aihot.virxact.com/items/cmqzhlkh701qssltjvr8wu46b
- 原文链接：https://arxiv.org/abs/2606.26978

## AI 摘要

本研究分析了SWE-bench排行榜上7,745个代理轨迹，并在200个实例上评估Claude Code、Codex与开源OpenCode采用四种执行范式的3,000次修复尝试。结果显示：代码执行平均每任务8.8次测试运行，频率2-19，后期成功率更高；对商用SOTA代理，禁止执行与无限制执行间修复成功率差距仅1.25个百分点（无统计显著性），但禁止执行显著节省token与墙钟时间；执行收益集中而非均匀分布。研究表明当前代理不加区分地使用代码执行，应将其视为有明确成本收益权衡的资源。

## 正文

LLM-based agents for program repair are increasingly built on a "generate-run-revise" paradigm, iteratively executing tests to evaluate and refine patches. This execution-based approach has become standard practice in state-of-the-art systems. However, executions can be time-consuming and expensive, yet their impact on these agents remains underexplored. In this paper, we conduct a two-stage empirical study over execution behavior in LLM-based program repair. To characterize execution behavior at scale, we first analyze 7,745 agent traces from SWE-bench leaderboard submissions. Second, we evaluate 3,000 end-to-end repair attempts across 200 SWE-bench instances and three agents (Claude Code, Codex, and the open-source OpenCode) under four execution paradigms, which allows for a fine-grained comparison of performance and cost. Our analysis reveals three key observations: (1) Code execution is used across all agents and models analyzed, with an average of 8.8 test runs per task. Execution behavior varies substantially across agents and models, with frequency ranging from 2 to 19 per task, and late-stage executions consistently achieve higher success rates than early-stage ones. (2) Execution restrictions have little effect on repair success: on commercial agents with SOTA models the resolve-rate gap between Prohibited and Unrestricted is only 1.25 percentage points and not statistically significant, while Prohibited saves substantial token and wall-clock cost. (3) Execution benefit is concentrated rather than uniform. These patterns suggest that current agents apply execution indiscriminately, paying its cost on instances where it provides little benefit. Execution, therefore, should be treated as a resource with an explicit cost-benefit tradeoff, not a default capability.
