弱监督下大语言模型何时能学会推理?
阅读原文· arxiv.org本研究探索大语言模型在弱监督下通过RLVR学习推理的机制。在稀缺数据、噪声奖励和自监督代理奖励三种场景中,训练奖励饱和动态决定泛化能力:延长预饱和阶段促进泛化,快速饱和导致记忆。推理忠实度(中间步骤对答案的逻辑支持程度)是预测模型表现的关键属性。研究表明,显式推理轨迹上的监督微调对弱监督泛化至关重要,结合领域数据持续预训练,可使Llama3.2-3B-Base在原本失败的三种场景中均实现泛化。
Large language models have achieved significant reasoning improvements through reinforcement learning with verifiable rewards (RLVR). Yet as model capabilities grow, constructing high-quality reward signals becomes increasingly difficult, making it essential to understand when RLVR can succeed under weaker forms of supervision. We conduct a systematic empirical study across diverse model families and reasoning domains under three weak supervision settings: scarce data, noisy rewards, and self-supervised proxy rewards. We find that generalization is governed by training reward saturation dynamics: models that generalize exhibit a prolonged pre-saturation phase during which training reward and downstream performance climb together, while models that saturate rapidly memorize rather than learn. We identify reasoning faithfulness, defined as the extent to which intermediate steps logically support the final answer, as the pre-RL property that predicts which regime a model falls into, while output diversity alone is uninformative. Motivated by these findings, we disentangle the contributions of continual pre-training and supervised fine-tuning, finding that SFT on explicit reasoning traces is necessary for generalization under weak supervision, while continual pre-training on domain data amplifies the effect. Applied together to Llama3.2-3B-Base, these interventions enable generalization across all three settings where the base model previously failed.