一篇新论文揭示了大型推理模型的“生产-评估差距”:模型能解出数学题并得到正确答案,但在评估他人推理时,即便逻辑有缺失步骤、前提颠倒或循环论证等明显缺陷,只要最终答案正确,模型也往往判定为合格。作者提出VAIR(有效答案-无效推理)基准验证该问题。这种现象称为“答案确认偏差”,模型仅凭正确答案而非有效逻辑评判推理。与人类相比,模型从解题到评估的能力下降更显著,表明AI可能成为制造看似合理论点的自信引擎,而非真正理解自身产出的推理引擎。
This paper shows a strange weakness in AI reasoning: models can solve math, yet fail to judge reasoning.
The unsettling part is not that frontier models make arithmetic mistakes.
It is that they can reach the right answer, see the right answer in someone else's solution, and then forgive broken logic that should have been easy to catch.
The authors call this the production-evaluation gap: the gap between generating a solution and evaluating whether a given solution actually earns its conclusion.
Their Valid-Answer-Invalid-Reasoning (VAIR) benchmark makes the trap clean.
The final answer is correct, but the reasoning is damaged by missing steps, shuffled steps, missing premises, or circular explanation.