DiPO:面向细粒度探索-利用权衡的解耦困惑度策略优化
阅读原文· arxiv.org针对RLVR训练中极端困难与简单样本的探索-利用困境,研究团队提出DiPO方法,通过困惑度空间解耦策略将样本划分为高困惑度探索子空间与低困惑度利用子空间,精准挖掘需精细权衡的样本,并设计双向奖励分配机制实现困惑度引导的稳定策略优化。实验表明,该方法在数学推理和函数调用任务中表现优异,有效增强了大语言模型的推理能力。
Reinforcement Learning with Verifiable Rewards (RLVR) has catalyzed significant advances in the reasoning capabilities of Large Language Models (LLMs). However, effectively managing the exploration and exploitation trade-off remains a critical challenge. In this paper, we fully analyze the exploration and exploitation dilemma of extremely hard and easy samples during the training and propose a new fine-grained trade-off mechanism. Concretely, we introduce a perplexity space disentangling strategy that divides the sample space into distinct exploration (high perplexity) and exploitation (low perplexity) subspaces, thereby mining fine-grained samples requiring exploration-exploitation trade-off. Subsequently, we propose a bidirectional reward allocation mechanism with a minimum impact on verification rewards to implement perplexity-guided exploration and exploitation, enabling more stable policy optimization. Finally, we have evaluated our method on two mainstream tasks: mathematical reasoning and function calling, and experimental results demonstrate the superiority of the proposed method, confirming its effectiveness in enhancing LLM performance by fine-grained exploration-exploitation trade-off.