# LiveBrowseComp：搜索智能体是在真正搜索，还是在验证既有知识？

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-05-27 08:00
- AIHOT 分数：65
- AIHOT 链接：https://aihot.virxact.com/items/cmpozt6xb0alvslv4cfd99kqh
- 原文链接：https://arxiv.org/abs/2605.28721

## AI 摘要

研究揭示基于LLM的搜索智能体存在“内在知识依赖”：在BrowseComp基准测试中，智能体在无需工具时仍能回答高达44.5%的问题，超过半数的搜索查询源于模型内部假设而非检索线索，移除支撑证据时其表现甚至差于闭卷基准。这表明静态基准可能奖励的是基于记忆的验证。为此，研究引入深度搜索基准LiveBrowseComp，包含335个依赖于基准构建前90天内发布事实的人工问题。在LiveBrowseComp上，所有智能体的闭卷准确率低于2%，搜索增强得分显著下降，且先前模型排名不再可靠。

## 正文

Are LLM-based search agents genuinely searching, or using the web to verify what they already know? We study this question on BrowseComp with three diagnostics. Our analysis reveals Intrinsic Knowledge Dependence (IKD): even with tool access, agents often rely on intrinsic knowledge -- information encoded in the model before retrieval -- rather than on external evidence. Agents answer up to 44.5% of BrowseComp questions without tools, generate more than half of their search queries from internally produced hypotheses rather than retrieved leads, and perform worse than closed-book baselines when answer-supporting evidence is removed. These results suggest that static search benchmarks can reward memory-backed verification rather than evidence-driven discovery, conflating what agents already know with what they can find. We then introduce LiveBrowseComp, a deep-search benchmark designed to evaluate agents beyond intrinsic coverage. It contains 335 human-authored questions whose answers depend on facts published within the 90 days preceding benchmark construction, drawn from six updated sources and filtered to exclude globally salient events. On LiveBrowseComp, all evaluated agents fall below 2% closed-book accuracy, search-augmented scores drop by 25-40 points relative to BrowseComp, and prior model rankings no longer reliably predict performance. LiveBrowseComp is available at https://huggingface.co/datasets/Forival/LiveBrowseComp.
