人工推理之谜：探究大型推理模型的生成-评估差距

2026-05-31 08:00·33天前

AI 摘要

人类评估推理通常比亲自推理差6%，但大型推理模型（LRM）存在显著生成-评估差距。基于VAIR数据集（含琐碎推理错误但答案正确的数学题）的测试显示，前沿LRM评估解题过程得分低至48%，尽管能近乎完美地生成正确答案。链式思维分析发现LRM存在答案确认偏差：先得答案再检查，而非逐句验证，甚至会编造合理化解释。线性探针和因果修补实验证实答案正确性主导判断，揭示当前推理训练方法在培养稳健评估能力上的根本缺陷。

原文 · 未翻译

Studies of human reasoning have shown that people are typically stronger at evaluating reasoning than producing it from scratch. In contrast, large reasoning models (LRMs) are trained to excel at producing long chains of reasoning to solve complex problems. How then do LRMs perform at evaluating reasons? We investigate this with the Valid-Answer-Invalid-Reasoning (VAIR) dataset: math problems and solutions with trivial reasoning flaws but valid answers, designed to isolate reasoning evaluation from the confound of reasoning production. Unlike humans, who we find are only 6% worse at grading than solving such problems, we find a substantial production-evaluation gap in LRMs: frontier models score as low as 48% when evaluating VAIR solutions, despite near-perfect solution production. Why this enigma? Through chain-of-thought (CoT) analysis, we find evidence of an answer confirmation bias: LRMs often produce then check for the correct answer instead of carefully verifying each step, fabricating rationalizations even when noticing anomalous reasoning. Linear probes corroborate this, showing that while LRM activations encode some representation of valid reasoning, they fail to robustly represent VAIR solutions as invalid. Causal patching of the final answer's representations causes LRM verdicts and activations to flip, demonstrating that answer validity is responsible for models' confirmation biases. These findings indicate an outstanding limitation in dominant approaches to reasoning training, which incentivize LRMs to produce and confirm reasoning towards correct answers, but not to robustly evaluate the underlying reasons.

HuggingFace Daily Papers（社区热门论文）

62导出 Markdown

人工推理之谜：探究大型推理模型的生成-评估差距

2026-05-31 08:00·33天前

阅读原文· arxiv.org

AI 摘要

原文 · 保持原样，未翻译