长存平衡：基于信息瓶颈的树策略优化

2026-05-27 08:00·37天前

AI 摘要

该研究针对大语言模型在线强化学习中探索与利用的失衡问题，提出了新指标IB-Score，它基于信息瓶颈理论，量化了步级推理多样性与正确答案信息的权衡。分析表明，主流方法如GRPO难以维持此平衡。为此，论文提出IB-TPO框架，将IB-Score作为优化目标，并采用信息瓶颈引导的树采样策略，在相同token预算下可增加50%的轨迹。实验显示，该方法在标准基准上显著优于GRPO基线，性能提升2.9%至3.6%。代码已开源：https://github.com/alibaba/EfficientRL。

原文 · 未翻译

Recent advances in online reinforcement learning (RL) for large language models (LLMs) have demonstrated promising performance in complex reasoning tasks. However, they often exhibit an imbalanced exploration-exploitation trade-off, resulting in unstable optimization and sub-optimal performance. We introduce IB-Score, a novel metric grounded in Information Bottleneck theory that evaluates policy's exploration-exploitation balance by quantifying the trade-off between step-level reasoning diversity and mutual information shared with the correct answer. Analysis based on IB-Score shows that popular online RL approaches (e.g., GRPO) with common regularizers fail to consistently maintain balance during training with suboptimal results. To address this, we propose Information Bottleneck-driven Tree-based Policy Optimization (IB-TPO), a principled framework that formulates IB-Score as a fine-grained optimization objective and utilizes a novel IB-guided tree sampling strategy that not only improves the efficiency of online sampling with 50% more trajectories under the same token budget, but also reuses the tree structure for effective IB-Score Monte Carlo estimation. Extensive experiments across standard benchmarks show that our method significantly outperforms GRPO baseline by 2.9% to 3.6% and also outperforms other state-of-the-art online RL approaches. Our code is available at https://github.com/alibaba/EfficientRL.

HuggingFace Daily Papers（社区热门论文）

62导出 Markdown

长存平衡：基于信息瓶颈的树策略优化

2026-05-27 08:00·37天前

阅读原文· arxiv.org

AI 摘要

原文 · 保持原样，未翻译