后训练中输出多样性在何处崩溃?
阅读原文· arxiv.org研究团队通过Olmo 3的三个后训练谱系(Think、Instruct、RL-Zero)追踪输出多样性变化。发现多样性崩溃与数据组成密切相关:Think在监督微调阶段损失大部分语义多样性,DPO对Instruct影响更大。抑制Think模型的思维链推理虽降低准确率但不改变多样性,证明崩溃由训练数据嵌入权重导致。在可验证任务中,Think虽总体崩溃更多但保留更多正确答案多样性。研究表明多样性崩溃由训练数据组成决定,无法仅靠推理时间解决。
Post-trained language models produce less varied outputs than their base counterparts. This output diversity collapse undermines inference-time scaling methods that rely on varied samples, and risks homogenizing model outputs on creative and value-laden tasks. Prior work attributes collapse to specific post-training methods, without separating the role of training data composition from the method, or the generation format from the model weights. We trace output diversity through three parallel post-training lineages of Olmo 3, Think (chain-of-thought distillation), Instruct (broad multi-source data), and RL-Zero, across 15 tasks and four text diversity metrics. We find that the location of collapse co-varies with data composition: the Think lineage loses most semantic diversity at supervised fine-tuning, and the effect of DPO is larger in Instruct than in Think. Suppressing chain-of-thought reasoning at inference in Think models drops accuracy on hard tasks, yet leaves answer-level diversity unchanged, showing that the collapse is embedded in the model weights by training data, not imposed by the generation format. Decomposing diversity loss on six verifiable tasks into a quality-control component (removal of incorrect outputs) and a residual component (genuine narrowing among correct outputs) reveals that the split is task-dependent, and Think models retain more correct-answer diversity than Instruct despite collapsing more in aggregate. Our results indicate that diversity collapse is determined during training by data composition and cannot be addressed at inference time alone.