后训练中输出多样性在何处崩溃？

2026-04-17 08:00·77天前

AI 摘要

研究团队通过Olmo 3的三个后训练谱系（Think、Instruct、RL-Zero）追踪输出多样性变化。发现多样性崩溃与数据组成密切相关：Think在监督微调阶段损失大部分语义多样性，DPO对Instruct影响更大。抑制Think模型的思维链推理虽降低准确率但不改变多样性，证明崩溃由训练数据嵌入权重导致。在可验证任务中，Think虽总体崩溃更多但保留更多正确答案多样性。研究表明多样性崩溃由训练数据组成决定，无法仅靠推理时间解决。

原文 · 未翻译

Post-trained language models produce less varied outputs than their base counterparts. This output diversity collapse undermines inference-time scaling methods that rely on varied samples, and risks homogenizing model outputs on creative and value-laden tasks. Prior work attributes collapse to specific post-training methods, without separating the role of training data composition from the method, or the generation format from the model weights. We trace output diversity through three parallel post-training lineages of Olmo 3, Think (chain-of-thought distillation), Instruct (broad multi-source data), and RL-Zero, across 15 tasks and four text diversity metrics. We find that the location of collapse co-varies with data composition: the Think lineage loses most semantic diversity at supervised fine-tuning, and the effect of DPO is larger in Instruct than in Think. Suppressing chain-of-thought reasoning at inference in Think models drops accuracy on hard tasks, yet leaves answer-level diversity unchanged, showing that the collapse is embedded in the model weights by training data, not imposed by the generation format. Decomposing diversity loss on six verifiable tasks into a quality-control component (removal of incorrect outputs) and a residual component (genuine narrowing among correct outputs) reveals that the split is task-dependent, and Think models retain more correct-answer diversity than Instruct despite collapsing more in aggregate. Our results indicate that diversity collapse is determined during training by data composition and cannot be addressed at inference time alone.

HuggingFace Daily Papers（社区热门论文）

导出 Markdown

后训练中输出多样性在何处崩溃？

2026-04-17 08:00·77天前

阅读原文· arxiv.org

AI 摘要

原文 · 保持原样，未翻译