# 更强的智能体将不仅来自更大的模型，而是来自其周围更好的系统

- 来源：Rohan Paul (@rohanpaul_ai)
- 发布时间：2026-05-29 16:08
- AIHOT 分数：64
- AIHOT 链接：https://aihot.virxact.com/items/cmpqncqj306juslnomu8xwd4d
- 原文链接：https://x.com/rohanpaul_ai/status/2060272036048695528

## AI 摘要

推文指出，AI智能体的强弱不只取决于模型，更依赖于模型周围的系统约束（harness）。该系统决定了模型的输入、可用工具、记忆及操作验证。核心进步应来自扩展此系统，尤其要提升上下文控制、记忆可信度以及工具或子智能体的路由能力。文中强调，长上下文不等于可用上下文，记忆多不等于可信，工具多不等于会用。这使得当前仅凭单次benchmark分数的评估方式显得薄弱。未来前沿在于扩展围绕智能体的系统约束，而不仅仅是扩展模型本身。相关论文标题为《From Model Scaling to System Scaling: Scaling the Harness in Agentic AI》。

## 正文

Stronger agents will not come only from larger models， but from better systems around them.

The problem is that many AI agents are judged as if the model alone did the work， even though the real behavior also depends on memory， tools， context， routing， checks， and permissions.

This surrounding setup around the agent is called harness， meaning the system that decides what the model sees， what tools it can use， what it remembers， and what actions get checked.

Progress should come from scaling this harness， especially 3 parts： better context control， more trustworthy memory， and better routing to tools or helper agents.

Long context is not the same as usable context， memory is not the same as trustworthy memory， and having many tools is not the same as knowing when to use them.

A stale note can be more dangerous than no note， because it gives the agent confidence exactly when it should re-check the world.

A specialized subagent can also fail quietly if its output sounds plausible but no later layer verifies whether it is true.

This is why one-shot benchmark scores feel increasingly thin.

Two agents can reach the same final answer， while one burns far more tokens， makes riskier tool calls， carries corrupted memory， or succeeds only by accident.

The next frontier is not just scaling the mind inside the machine.

It is scaling the discipline around it.

----

Link - arxiv. org/abs/2605.26112

Title： "From Model Scaling to System Scaling： Scaling the Harness in Agentic AI"