Rohan Paul@rohanpaul_ai · 5月29日64Stronger agents will not come only from larger models, but from better systems around them.
The problem is that many AI agents are judged as if the model alone did the work, even though the real behavior also depends on memory, tools, context, routing, checks, and permissions.
This surrounding setup around the agent is called harness, meaning the system that decides what the model sees, what tools it can use, what it remembers, and what actions get checked.
Progress should come from scaling this harness, especially 3 parts: better context control, more trustworthy memory, and better routing to tools or helper agents.
Long context is not the same as usable context, memory is not the same as trustworthy memory, and having many tools is not the same as knowing when to use them.
A stale note can be more dangerous than no note, because it gives the agent confidence exactly when it should re-check the world.
A specialized subagent can also fail quietly if its output sounds plausible but no later layer verifies whether it is true.
This is why one-shot benchmark scores feel increasingly thin.
Two agents can reach the same final answer, while one burns far more tokens, makes riskier tool calls, carries corrupted memory, or succeeds only by accident.
The next frontier is not just scaling the mind inside the machine.
It is scaling the discipline around it.
----
Link – arxiv. org/abs/2605.26112
Title: "From Model Scaling to System Scaling: Scaling the Harness in Agentic AI"
译推文指出,AI智能体的强弱不只取决于模型,更依赖于模型周围的系统约束(harness)。该系统决定了模型的输入、可用工具、记忆及操作验证。核心进步应来自扩展此系统,尤其要提升上下文控制、记忆可信度以及工具或子智能体的路由能力。文中强调,长上下文不等于可用上下文,记忆多不等于可信,工具多不等于会用。这使得当前仅凭单次benchmark分数的评估方式显得薄弱。未来前沿在于扩展围绕智能体的系统约束,而不仅仅是扩展模型本身。相关论文标题为《From Model Scaling to System Scaling: Scaling the Harness in Agentic AI》。