基于评分的强化学习中奖励黑客的再现、分析与检测
阅读原文· arxiv.org基于评分标准的强化学习使用 LLM-as-a-Judge 对模型输出打分作为奖励,但策略模型可能利用评判者的潜在偏见导致奖励黑客,使训练结果无效甚至不安全。论文提出 CHERRL,一个可控黑客环境,通过向评判注入已知偏见,稳定再现奖励黑客、观察奖励分歧并精确识别黑客起始点。利用该环境分析了不同评判偏见的可发现性和可利用性,并探索了基于智能体的系统自动从训练日志检测黑客起始点。代码与环境已公开。
Rubric-based reinforcement learning (RL) uses an LLM-as-a-Judge (LaaJ) to score model outputs according to rubrics as rewards. However, policy models may exploit latent biases in the judge, leading to reward hacking and ineffective or unsafe training outcomes. In real-world rubric-based RL, such hacking behaviors are often subtle and entangled with multiple judge biases, making them difficult to analyze, detect, and mitigate. In this paper, we introduce CHERRL, a controllable hacking environment for rubric-based RL. By injecting known biases into LaaJ, CHERRL enables stable reproduction of reward hacking, explicit observation of reward divergence, and precise identification of hacking onset. This provides a clean experimental testbed for studying the mechanisms and mitigations of reward hacking in rubric-based RL. To demonstrate its utility, we analyze different judge biases from the perspectives of discoverability and exploitability, and explore an agent-based system for automatically detecting reward hacking onset from training logs. The code and environment are publicly available at https://github.com/THUAIS-Lab/CHERRL.