原文 · 未翻译
AI coding agents find the right file but miss the exact lines that matter, study shows
A new benchmark separates code search from the actual fix and exposes a hidden weakness of AI coding agents. They land in the right neighborhood but miss the crucial spots.
Until now, AI coding has mostly been judged by the result. Did the agent fix the bug or not? That single metric hides what actually went wrong. Maybe the agent never read the relevant code. Maybe it saw the correct file and still wrote the wrong patch. Either way, the outcome looks the same.
An international research team involving Shanghai Jiao Tong University is tackling this blind spot with SWE-Explore. The benchmark only evaluates the first phase of the process. An agent receives a bug description and a software project, then returns a ranked list of code sections it considers relevant.
Successful runs set the reference
Figuring out which sections truly matter is nearly impossible to do by hand. So the team takes a different approach. For each of the 848 problems in the dataset, at least two successful solution attempts exist from powerful models like GPT-5.4, Gemini 3 Pro, Claude Sonnet 4.6, or Kimi K2.6.
From these runs, the researchers extract which files and lines the AI actually examined before fixing the bug. Passages that multiple independent solution paths converge on count as a signal of useful context. They're not strictly required, but strongly indicated. A separate verification step fills in individual key passages, and the team then manually reviews each region again.
The dataset draws from 203 open-source projects across ten programming languages. Python dominates with 547 of 848 tasks, followed by Go, JavaScript, and Rust.
Keyword search barely beats chance
The comparison pits traditional search methods against five general-purpose coding agents, including Claude Code, Codex, and OpenHands, along with four research systems built specifically for code search.
Old-school keyword search barely beats chance. In a case study, the authors show why. A bug description like "RuntimeWarning on Overflow" contains terms that show up far more often in a project's templates and docs than in the actual source code. AI agents pull ahead clearly because they search the project step by step instead of sorting all hits at once.
Line-level accuracy drops off a cliff
At the file level, the agents do fine. They find the right source file, rank it early, and keep the selection tight. But the moment the test zooms in to individual lines of code, the system falls apart. General coding agents cover only 14 to 19 percent of the lines that actually matter.
Throwing a stronger language model at the problem doesn't fix it. The team ran the same agent with six different models from OpenAI, Anthropic, Google, Moonshot, and Zhipu. The GPT family leads, but the pattern holds. File hit rates stay consistently higher than actual line coverage.