Meta研究发现,强制大语言模型(LLM)在分析代码时遵循检查清单、逐步展示推理证明,能将其代码补丁错误率降低近50%。常见错误源于模型过早识别熟悉名称(如“format”)并直接套用通用含义,而非实际检查项目文件,导致其依赖自信猜测而非深入分析。通过要求模型明确写出修改内容、追踪执行路径并用具体证据证明结论,这一方法迫使其实际阅读本地文件、遵循真实逻辑,从而将准确率提升至93%。该方法无需昂贵的新训练或复杂系统,仅通过基本的结构化提示即可实现高可靠性的代码验证,节省了运行软件测试的巨大计算成本。
"Can LLM agents explore codebases and reason about code semantics without executing the code?"
Meta discovered that if you force an LLM to show its reasoning step by step with proof, its code patch error rate drops by nearly 50%.
The finding is not that models suddenly became deeper thinkers.
It is that many code errors come from premature recognition: the model sees a familiar name, such as format, and quietly substitutes the usual meaning before checking the project's actual files.
If you just ask a standard LLM to check the code without running it, the model usually just glances at the function names and makes a confident guess.