# SWE-Explore：编码智能体仓库探索能力评测基准

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-05 08:00
- AIHOT 分数：65
- AIHOT 链接：https://aihot.virxact.com/items/cmq61s9bh04cxsl5i7d4vwyst
- 原文链接：https://arxiv.org/abs/2606.07297

## AI 摘要

SWE-Explore 是一个专为评测编码智能体仓库探索能力而设计的基准，覆盖 848 个 issue、10 种编程语言和 203 个开源仓库。每项任务要求探索者在固定行预算内返回相关代码区域的有序列表，ground truth 来自成功解决同一 issue 的独立智能体轨迹。评测从覆盖率、排名和上下文效率三个维度展开，发现这些指标与下游修复行为高度相关。结果显示，智能体探索器整体明显优于传统检索方法，但文件级定位已足够强，行级覆盖率和高效排名才是区分前沿探索器能力的关键。

## 正文

Repository-level coding benchmarks such as SWE-bench have driven a rapid surge in the capabilities of coding agents. Yet they usually treat coding tasks as a holistic, binary prediction problem (e.g., resolved or unresolved), neglecting fine-grained agent capabilities such as repository understanding, context retrieval, code localization, and bug diagnosis. In this paper, we introduce SWE-Explore, a benchmark that isolates the evaluation of repository exploration, a critical capability of coding agents. Given a repository and an issue, SWE-Explore asks an explorer to return a ranked list of relevant code regions under a fixed line budget. SWE-Explore covers 848 issues across 10 programming languages and 203 open-source repositories. For each instance, we derive line-level ground truth from independent agent trajectories that successfully solved the same issue, distilling the specific code regions their solution paths actually consulted. We evaluate exploration along coverage, ranking, and context-efficiency dimensions, showing that these metrics strongly track downstream repair behavior. Across a broad set of retrieval methods, general coding agents, and specialized localizers, we find that agentic explorers form a clear tier above classical retrieval. While file-level localization is already strong for modern methods, line-level coverage and efficient ranking remain the key axes differentiating state-of-the-art explorers.
