# AI 搜索代理往往只是确认其已知信息，而非真正研究网络

- 来源：The Decoder：AI News（RSS）
- 作者：Jonathan Kemper
- 发布时间：2026-05-31 15:48
- AIHOT 分数：60
- AIHOT 链接：https://aihot.virxact.com/items/cmptibjj500amsl0ej5o7uukz
- 原文链接：https://the-decoder.com/ai-search-agents-often-confirm-what-they-already-know-instead-of-actually-researching-the-web

## AI 摘要

哈尔滨工业大学研究人员发现，包括 GPT-5.4 和 Kimi K2.6 在内的领先 AI 搜索代理，在已有的基准测试上并未进行太多真正的网络研究。它们主要利用网络来确认其在训练阶段已学到的知识。研究团队使用名为 LiveBrowseComp 的新基准测试得出了该结论，此测试仅涉及过去 90 天内的事件。当模型无法依赖既有记忆时，其表现显著下降，现有的性能排名也随之改变。

## 正文

AI search agents often confirm what they already know instead of actually researching the web

A new study suggests that leading AI search agents don't actually research on established benchmarks; they mostly use the web to confirm answers they already have. Once models have to go beyond their existing knowledge, search performance falls apart.

Frontier models like GPT-5.4, Gemini 3.1 Pro, Claude Sonnet 4.6, DeepSeek-V4-Pro, and Kimi-K2.6 keep posting higher scores on BrowseComp. The benchmark asks agents complex questions that can only be answered through multi-step browsing and piecing together information from different web sources.

Researchers from the Harbin Institute of Technology and Xiaohongshu have now shown in a study that these results say less about the agents' research skills than assumed. The authors call it "intrinsic knowledge dependence" (IKD), a reliance on internal knowledge the models absorbed during training.

The researchers tested eleven models total, first stripping away all search and browsing tools. Even without internet access, the models scored surprisingly high. MiniMax M2.5 solved 44.5 percent of BrowseComp tasks from memory alone. Kimi K2.6 hit 62 percent on the Chinese BrowseComp-ZH variant. A big chunk of benchmark performance, in other words, comes before any search even happens.

Searching can actually hurt the answer

The second test is more telling. The researchers left the search interface in place but removed all answer-supporting documents from the search index. Every model tested then performed worse than it did without any tool access at all. MiniMax M2.5 dropped from 44.5 to 8.0 percent. Kimi-K2.6 fell from 25.5 to 2.3 percent. The search actively pulls agents away from correct gut-feeling answers as soon as no confirming hits show up.

An analysis of the search paths explains why. More than half of all queries come from the model's own reasoning rather than from previously found hits. Even when relevant evidence does appear in search results, the agents fold it into their reasoning less than a third of the time. The loop is model-led, not evidence-led.

A benchmark beyond the knowledge frontier

To measure real search behavior, the authors built LiveBrowseComp. The benchmark contains 335 human-written questions, each depending on at least one fact from the 90 days before creation and impossible to answer without that current information.

The underlying events come from constantly updated sources like film databases, game directories, security vulnerability registers, and earthquake catalogs. Globally prominent events are filtered out deliberately, leaving obscure but publicly verifiable facts that had little chance of seeping into model parameters during training.

Human testers need about the same amount of time for LiveBrowseComp as for BrowseComp and solve a similar number of tasks. The performance drop among models is therefore due to losing the memory shortcut, not because the questions are harder.

Leaderboard rankings fall apart

On LiveBrowseComp, all models in the closed-book test fall below two percent accuracy. With tools turned on, scores land about 25 to 40 points below the same models' BrowseComp results.

This shifts the rankings. GLM 5.1 leads clearly among open-source models on BrowseComp but falls to mid-pack on LiveBrowseComp. DeepSeek v3.2 sat at the bottom on BrowseComp, then climbed to the top on LiveBrowseComp, passing several models that previously outperformed it. This shows that a model's spot on a static leaderboard mostly shows how much it already knows, not how well it searches.

Agents need more steps when they can't rely on memory

On BrowseComp, agents solve many questions in very few steps, a sign of quick memory confirmation. On LiveBrowseComp, that pattern disappears. The step counts shift much higher, which suggests the agents are doing real research instead of recalling stored knowledge.

The authors argue that dynamic, time-sensitive benchmarks should become the standard for evaluating AI agents. They also want training signals that reward evidence-based research over the typical guess-and-verify approach.

Other studies have flagged similar problems. A benchmark from Peking University found that top models often produce the right answer when analyzing documents but cite the wrong source, what the researchers call "attribution hallucination." A tool called CiteAudit recently discovered that fabricated references have already made it into accepted papers at major AI conferences. The reason: commercial models don't reliably catch made-up citations.

AI News Without the Hype – Curated by Humans
