# 小模型是GRPO中策略级多样性的天然探索者：S2L-PO框架

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-02 08:00
- AIHOT 分数：41
- AIHOT 链接：https://aihot.virxact.com/items/cmqer3py204vhslunhfwlgm82
- 原文链接：https://arxiv.org/abs/2605.30789

## AI 摘要

针对GRPO训练中rollout多样性不足的问题，研究发现同一模型家族内的小模型天然具有更高的策略级多样性（pass@k优于大模型），且这种多样性具有时序相关性、逻辑一致性和结构化探索信号。提出S2L-PO框架，利用固定小模型作为探索者训练大模型，并设计渐进退火策略从小模型离线rollout过渡到大模型自身采样，避免性能下降、加速收敛。S2L-PO在多个数学推理基准上提升准确率，例如用1.7B探索者引导8B模型在AIME 24上提升8.8%，同时减少rollout计算量。

## 正文

We identify a new dimension for enhancing rollout diversity in Group Relative Policy Optimization (GRPO) for LLMs. While GRPO relies on diverse rollouts, prevailing strategies primarily increase diversity by injecting more token-level randomness, which may introduce step-wise noise and lead to incoherent trajectories. We uncover that smaller models within the same model family inherently exhibit higher policy-level diversity, indicated by their superior pass@k relative to larger counterparts as sample counts increase. Unlike token-level noise, this diversity is temporally correlated, preserves logical consistency, and provides structured exploration signals for gradient estimation. We thus propose S2L-PO (Small-to-Large Policy Optimization), a framework that leverages fixed small models as natural explorers to train larger models. To balance exploration and exploitation, we design a progressive annealing strategy that transitions from offline small-model rollouts to the large learner's own sampling. This shift elegantly avoids mid-training performance drops caused by the small model's capacity limits, achieving faster convergence and unlocking a higher performance ceiling. S2L-PO improves accuracy on diverse mathematical reasoning benchmarks (e.g., +8.8% on AIME 24 using a 1.7B explorer to guide the 8B model) while reducing rollout compute.