# DVAO：面向多奖励强化学习的动态方差自适应优势优化

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-05-25 08:00
- AIHOT 分数：64
- AIHOT 链接：https://aihot.virxact.com/items/cmpm0ai190kf7sl01e65fmhtq
- 原文链接：https://arxiv.org/abs/2605.25604

## AI 摘要

针对多奖励强化学习中，传统标量化方法（如奖励组合与优势组合）导致的训练不稳定或依赖静态超参数问题，本文提出动态方差自适应优势优化（DVAO）。该方法根据每轮采样中各目标的经验奖励方差动态调整组合权重，强化学习信号强的目标并抑制噪声。文中证明DVAO能保持优势幅度有界以稳定训练，并引入自适应跨目标正则化机制。在Qwen3和Qwen2.5模型上的数学推理与工具使用基准测试显示，DVAO显著优于基线方法，在多目标帕累托前沿和训练稳定性上取得更优结果。

## 正文

Reinforcement Learning has become a standard paradigm for aligning Large Language Models with human intent and task requirements. While Group Relative Policy Optimization offers an efficient, value-model-free alternative to Proximal Policy Optimization, adapting it to real-world multi-reward settings remains challenging. Standard scalarization practices, such as Reward Combination and Advantage Combination, suffer from significant drawbacks: Reward Combination frequently generates advantages with excessively large squared magnitudes that lead to training instability, while Advantage Combination relies on static hyperparameters and ignores cross-objective correlations. To address these limitations, we propose Dynamic Variance-adaptive Advantage Optimization (DVAO), which dynamically adjusts combination weights based on the empirical reward variance of each objective within a rollout group, effectively up-weighting objectives with a stronger learning signal while suppressing noisy ones. We mathematically prove that DVAO maintains bounded advantage magnitudes for stable training and introduces a self-adaptive cross-objective regularization mechanism. Extensive experiments on mathematical reasoning and tool-use benchmarks using Qwen3 and Qwen2.5 models demonstrate that DVAO significantly outperforms baseline methods, achieving a superior multi-objective Pareto frontier and robust training stability.
