# AI编码智能体虽能定位正确文件，但常错过关键代码行，研究显示

- 来源：The Decoder：AI News（RSS）
- 作者：Jonathan Kemper
- 发布时间：2026-06-14 16:54
- AIHOT 分数：59
- AIHOT 链接：https://aihot.virxact.com/items/cmqdkf44d008nslm7dze6rmcu
- 原文链接：https://the-decoder.com/ai-coding-agents-find-the-right-file-but-miss-the-exact-lines-that-matter-study-shows

## AI 摘要

AI编码智能体Claude Code和Codex能可靠找到正确文件，但漏掉其中大部分关键代码行。新的SWE-Explore基准首次将代码搜索与实际修复分开测试，证明缺乏足够上下文时，即使最佳修复方案也会失败。

## 正文

AI coding agents find the right file but miss the exact lines that matter, study shows

A new benchmark separates code search from the actual fix and exposes a hidden weakness of AI coding agents. They land in the right neighborhood but miss the crucial spots.

Until now, AI coding has mostly been judged by the result. Did the agent fix the bug or not? That single metric hides what actually went wrong. Maybe the agent never read the relevant code. Maybe it saw the correct file and still wrote the wrong patch. Either way, the outcome looks the same.

An international research team involving Shanghai Jiao Tong University is tackling this blind spot with SWE-Explore. The benchmark only evaluates the first phase of the process. An agent receives a bug description and a software project, then returns a ranked list of code sections it considers relevant.

Successful runs set the reference

Figuring out which sections truly matter is nearly impossible to do by hand. So the team takes a different approach. For each of the 848 problems in the dataset, at least two successful solution attempts exist from powerful models like GPT-5.4, Gemini 3 Pro, Claude Sonnet 4.6, or Kimi K2.6.

From these runs, the researchers extract which files and lines the AI actually examined before fixing the bug. Passages that multiple independent solution paths converge on count as a signal of useful context. They're not strictly required, but strongly indicated. A separate verification step fills in individual key passages, and the team then manually reviews each region again.

The dataset draws from 203 open-source projects across ten programming languages. Python dominates with 547 of 848 tasks, followed by Go, JavaScript, and Rust.

Keyword search barely beats chance

The comparison pits traditional search methods against five general-purpose coding agents, including Claude Code, Codex, and OpenHands, along with four research systems built specifically for code search.

Old-school keyword search barely beats chance. In a case study, the authors show why. A bug description like "RuntimeWarning on Overflow" contains terms that show up far more often in a project's templates and docs than in the actual source code. AI agents pull ahead clearly because they search the project step by step instead of sorting all hits at once.

Line-level accuracy drops off a cliff

At the file level, the agents do fine. They find the right source file, rank it early, and keep the selection tight. But the moment the test zooms in to individual lines of code, the system falls apart. General coding agents cover only 14 to 19 percent of the lines that actually matter.

Throwing a stronger language model at the problem doesn't fix it. The team ran the same agent with six different models from OpenAI, Anthropic, Google, Moonshot, and Zhipu. The GPT family leads, but the pattern holds. File hit rates stay consistently higher than actual line coverage.

The various agent architectures land strikingly close to each other. Claude Code, Codex, OpenHands, Mini-SWE-Agent, and AweAgent post nearly identical scores across every metric.

The CoSIL research system is the outlier. It scans code as a network of interconnected building blocks and achieves much higher line coverage. Among the specialized localization systems, AutoCodeRover works precisely but stays conservative, while OrcaLoca produces little noise but misses many relevant spots.

Repairs fail below a minimum context threshold

In a controlled ablation experiment, the team artificially varied the context. The repair model saw only 0, 25, 50, 75, or 100 percent of the core regions, sometimes padded with irrelevant non-core code. For the easier tasks in the dataset, a clear threshold effect shows up. As long as less than half the necessary core regions are visible, repairs mostly fail.

The success rate only jumps between 50 and 75 percent coverage. Fixes don't improve gradually. They need a minimum amount of clues before anything clicks. For harder tasks, the effect is much narrower. If the problem already exceeds the model's ability, even better context doesn't help much.

Once the critical spots are available, irrelevant extra code barely gets in the way. An agent that reads too little does worse than one that reads too much. The takeaway for future improvements is clear: Filter less, read more. Code and data are available on GitHub and Hugging Face.

About two years ago, a research group created SWE-bench, a benchmark that tests AI coding agents against real GitHub issue reports. That spawned a whole family of variants covering more languages, cleaner data, and harder professional tasks. Lately, though, the underlying success metric has come under pressure from several directions. A study by the research organization METR found that project managers would reject about half the solutions the automated reviewer accepted, many of them because of basic functional errors.

AI News Without the Hype – Curated by Humans
