# 斯坦福研究：在同等推理预算下，单智能体LLM通常优于多智能体系统处理多跳问题

- 来源：Rohan Paul (@rohanpaul_ai)
- 发布时间：2026-05-17 17:08
- AIHOT 分数：57
- AIHOT 链接：https://aihot.virxact.com/items/cmp9k1p7y0ri0slnza7wwqi08
- 原文链接：https://x.com/rohanpaul_ai/status/2055938373270081642

## AI 摘要

斯坦福论文论证，在相等推理令牌预算下，单个LLM解决多跳问题通常比多代理系统更有效。核心在于单代理能保持完整的内部思维链，而多代理需将思维分割为消息传递与交接，每次交接都压缩信息并导致丢失，这以数据处理不等式为形式化解释。实验在多个模型和数据集上验证，预算匹配时单代理表现等同或优于多种多代理设置。多代理的常见增益可能源于额外计算或评估偏差，而非架构优势。论文建议，多跳推理应默认从强单代理开始，仅当单代理上下文受干扰退化时，才将多代理结构作为修复策略使用。

## 正文

New Stanford paper argues that， under equal reasoning budgets， one LLM usually solves multi-hop problems better than many coordinated ones.

The core point is almost embarrassingly simple.

A single agent keeps the whole problem in one internal chain of thought， while a multi-agent system has to slice that chain into messages， summaries， and handoffs.

Every handoff is a compression step.

And once reasoning is compressed， some information is easier to drop than to recover， which is why the paper leans on the Data Processing Inequality as a formal explanation rather than just an empirical hunch.

The experiments back that up across Qwen， DeepSeek， and Gemini on FRAMES and MuSiQue： when thinking-token budgets are matched， single-agent systems usually match or beat sequential， debate， role-based， and ensemble setups.

Here's the part most people miss.

Many celebrated multi-agent gains may not be architectural gains at all. They often come from spending more test-time compute， surfacing more visible reasoning， or benefiting from evaluation quirks that make the pipeline look smarter than it is.

The paper is especially sharp when it looks for the boundary case instead of pretending the rule is universal.

When the single agent's effective context is degraded by masking， substitution， or misleading distractors， multi-agent pipelines become more competitive and sometimes win， not because message passing is magical， but because structure can partially stabilize corrupted reasoning.

That is a much narrower and more useful claim than "more agents is better."

It suggests the real trade-off is not single versus multi so much as latent reasoning versus external coordination， with context quality and compute accounting deciding which side looks stronger.

For multi-hop reasoning， the default should now be clear： start with one strong model， and treat extra agents as a repair strategy， not an upgrade.

----

Paper Link - arxiv. org/abs/2604.02460

Paper Title： "Single-Agent LLMs Outperform Multi-Agent Systems on Multi-Hop Reasoning Under Equal Thinking Token Budgets"