ProRL:基于修正策略梯度估计的主动推荐强化学习框架
阅读原文· arxiv.org针对主动推荐系统中简单应用策略梯度方法存在的梯度估计缺陷,研究提出ProRL强化学习框架。该框架识别出路径级奖励分解为步级奖励时产生的长度依赖偏差,以及忽略分解结构导致的高方差问题。ProRL引入两个机制:逐步奖励中心化通过减去期望奖励消除长度偏差,位置特定优势估计利用奖励分解结构计算步级基线以降低方差。实验表明,ProRL在三个真实数据集上显著优于现有先进方法。
Proactive Recommender Systems (PRSs) aim to guide user preference shift toward target items by generating paths of intermediate recommendations. Reinforcement learning (RL) provides a principled framework for optimizing such sequential decision tasks, as path rewards can naturally capture both short-term acceptance and long-term guidance effectiveness. However, naively applying policy gradients to PRS results in deficient gradient estimation. We identify two deficiencies: (1) path-level rewards decompose into step-level rewards with positive mean, creating a length-dependent bias that causes gradients to favor path extension over meaningful exploration; (2) weighting each step by the entire path-level reward ignores the decomposition structure, leading to high gradient variance. To rectify these two deficiencies, we propose an effective RL framework ProRL with two novel mechanisms for proactive recommendation. First, Stepwise Reward Centering subtracts expected rewards to neutralize length-dependent bias, ensuring that path extension yields zero expected gradient signal. Second, Position-Specific Advantage Estimation leverages the reward decomposition structure to compute step-dependent baselines, reducing gradient variance. Together, these mechanisms yield policy gradients that precisely target path quality. Our experiments on three real-world datasets demonstrate that ProRL significantly outperforms state-of-the-art PRSs. Our code is available at https://github.com/hongruhou89/ProRL.