# 揭示大型推理模型中的隐藏批判机制

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-05-22 08:00
- AIHOT 分数：63
- AIHOT 链接：https://aihot.virxact.com/items/cmpmd6ghh0nolsl01qih4s9lx
- 原文链接：https://arxiv.org/abs/2603.16331

## AI 摘要

本研究探究了大型推理模型（LRMs）的错误恢复机制。通过在推理步骤中插入算术错误，发现了一个关键现象：即使错误贯穿整个思维链（CoT）而未被语言化纠正，模型在思考结束后仍能输出正确答案。这证明模型内部存在一种“隐藏批判能力”来检测错误并触发纠正。基于特征空间分析，研究者识别出一个可解释的批判向量来表征该行为。跨模型规模和家族的实验表明，利用此向量引导潜在表示，能在不增加训练成本的情况下，提升模型的错误检测能力并增强测试时扩展性能。

## 正文

Large Reasoning Models (LRMs) exhibit backtracking and self-verification mechanisms that enable them to revise intermediate steps and reach correct solutions, yielding strong performance on complex logical benchmarks. We hypothesize that such behaviors are beneficial only when the model has sufficiently strong ``critique'' ability to detect its own mistakes. This work systematically investigates how current LRMs recover from errors by inserting arithmetic mistakes in their intermediate reasoning steps. Notably, we discover a peculiar yet important phenomenon: despite the error propagating throughout the entire chain-of-thought (CoT) without any verbalized correction, the model still reaches the correct final answer after the thinking process finishes. This recovery implies the existence of an internal mechanism helping the model to detect errors and trigger self-correction, which we refer to as the hidden critique ability. Building on feature space analysis, we identify a highly interpretable critique vector representing this behavior. Extensive experiments across multiple model scales and families demonstrate that steering latent representations with this vector improves the model's error detection capability and enhances the performance of test-time scaling at no extra training cost. Our findings provide a valuable understanding of LRMs' critique behavior, suggesting a promising direction to control and improve their self-verification mechanism. Our code is available at: https://github.com/mail-research/lrm-critique-vectors.
