被共识掩盖:解耦 LLM 正确性判断中的特权知识
阅读原文· arxiv.org通过训练正确性分类器比较模型自身隐藏状态与外部模型表示,研究发现大语言模型在事实知识任务中拥有领域特定的特权知识,但在数学推理中不存在。标准评估显示自我探测与同伴探测性能相当,但在模型预测不一致的子集上,自我表示在事实任务中持续优于同伴表示。层-wise 分析表明,事实知识的特权优势从早期到中期层逐渐显现,与模型特定记忆检索机制一致,而数学推理在任何深度均无此优势。
Humans use introspection to evaluate their understanding through private internal states inaccessible to external observers. We investigate whether large language models possess similar privileged knowledge about answer correctness, information unavailable through external observation. We train correctness classifiers on question representations from both a model's own hidden states and external models, testing whether self-representations provide a performance advantage. On standard evaluation, we find no advantage: self-probes perform comparably to peer-model probes. We hypothesize this is due to high inter-model agreement of answer correctness. To isolate genuine privileged knowledge, we evaluate on disagreement subsets, where models produce conflicting predictions. Here, we discover domain-specific privileged knowledge: self-representations consistently outperform peer representations in factual knowledge tasks, but show no advantage in math reasoning. We further localize this domain asymmetry across model layers, finding that the factual advantage emerges progressively from early-to-mid layers onward, consistent with model-specific memory retrieval, while math reasoning shows no consistent advantage at any depth.