# 智能体编程的测试时计算规模化

- 来源：Rohan Paul (@rohanpaul_ai)
- 发布时间：2026-05-23 22:29
- AIHOT 分数：60
- AIHOT 链接：https://aihot.virxact.com/items/cmpih2tm40x1fsljwmtsu6l60
- 原文链接：https://x.com/rohanpaul_ai/status/2058193482716758053

## AI 摘要

Meta研究发现，在编程智能体任务中，通过复用过往尝试的简短摘要，其性能显著优于使用原始日志。该论文指出，对于长程编程任务，主要瓶颈已从代码生成转向了如何有效记忆与表示智能体的工作过程。其方法是将每次充满错误的“混乱轨迹”转化为包含核心假设、进展与失败点的紧凑摘要，系统通过锦标赛式选择最佳摘要来指导新一轮尝试。在Claude 4.5 Opus的测试中，该方法使其在SWE-Bench Verified上的得分从70.9%大幅提升至77.6%，证明提升性能的关键在于以可复用的形式存储经验。

## 正文

Meta paper shows that coding agents get much better when they reuse short summaries of past attempts instead of raw logs.

i.e. stronger coding agents do not just need more attempts， but better ways to remember attempts.

That sounds obvious until you look at what an agent actually produces： not an answer， but a messy trail of file reads， shell commands， errors， partial fixes， and abandoned ideas.

The paper's idea is to turn each full attempt into a compact summary of the main guess， partial progress， and failure points， then use those summaries both to pick the best attempts and to guide new ones.

Test-time scaling breaks when the model cannot compare its own past work.

For short answers， ranking is easy.

For long-horizon coding， the bottleneck shifts from generation to representation.

Once rollouts become summaries， two useful things happen.

The system can run tournament-style selection over small groups of candidates， which works better than forcing one giant comparison， and it can feed the best summaries back into a fresh round of attempts instead of starting blind.

---

The authors test this on 2 hard coding benchmarks by running many attempts in parallel， selecting promising summaries with a tournament style voting method， and then launching fresh attempts that can read the selected summaries first.

The results are strong， with Claude 4.5 Opus rising from 70.9% to 77.6% on SWE-Bench Verified and from 46.9% to 59.1% on Terminal-Bench v2.0.

What matters is that the paper says better test-time scaling for long coding agents is not mostly about making more attempts， but about storing experience in a form the agent can actually reuse.

----

Paper Link - arxiv. org/abs/2604.16529

Paper Title： "Scaling Test-Time Compute for Agentic Coding"