# 更深并非总是更好：通过Confident Decoding缓解对齐税

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-20 08:00
- AIHOT 分数：54
- AIHOT 链接：https://aihot.virxact.com/items/cmqqb5ey4088lslp5re24zwkq
- 原文链接：https://arxiv.org/abs/2606.21906

## AI 摘要

大语言模型自动回归生成传统上从最终层解码，但研究发现最终层可能将预测扰动到通用或对齐偏好的token，造成对齐税。Confident Decoding是一种无需训练的解码策略，通过熵引导的保守向后搜索动态选择最可靠的近最终层，并将层选择理论化为最优停止问题。在稠密和MoE大语言模型上，该方法在GPQA-Diamond、Omni-MATH、HLE等推理基准上取得一致改进，零内存开销，延迟增加不到2%。结果表明，动态绕过最终层扰动可以解锁更强推理能力。

## 正文

Autoregressive generation in large language models (LLMs) conventionally decodes from the final layer, assuming that deeper representations yield more reliable next-token predictions. We revisit this assumption by revealing a recurring Guess-Refine-Perturb dynamic: early layers form coarse guesses, intermediate layers refine reasoning-relevant semantics, and final layers can perturb these refined predictions toward generic or alignment-preferred tokens. We introduce Confident Decoding, a training-free decoding strategy that dynamically selects the most reliable near-final layer through entropy-guided conservative backward search. We further provide a theoretical formulation of layer selection as an optimal stopping problem, showing that under bounded projection noise and dominant late-stage alignment perturbation, our search rule filters perturbation while bounding the loss relative to the oracle refinement layer. Experiments across dense and Mixture-of-Experts LLMs demonstrate consistent gains on challenging reasoning benchmarks, including GPQA-Diamond, Omni-MATH, and HLE, with zero memory overhead and less than 2% latency increase. These results suggest dynamically bypassing final-layer perturbations can unlock stronger reasoning behavior from aligned LLMs.
