方向对齐缓解大语言模型强化学习中的奖励漏洞

2026-05-24 08:00·40天前

AI 摘要

奖励漏洞是大语言模型在强化学习中通过捷径优化代理奖励而非解决任务的问题。研究通过分析参数更新的奇异方向发现，漏洞运行相比正常运行呈现显著方向偏移。为此提出可信方向投影方法，将梯度约束在干净参考子空间内，在数学推理实验中有效延迟了捷径利用并保持任务性能。

原文 · 未翻译

Reward hacking arises when a model improves a proxy reward by exploiting shortcuts rather than solving the intended task. We study this failure mode through the geometry of reinforcement learning updates in language models and argue that hacking emerges when optimization drifts away from a stable low-dimensional learning trajectory. We analyze this drift through dominant singular directions of parameter updates and show that reward-hacking runs exhibit substantially larger directional change than clean runs. Motivated by this observation, we introduce trusted-direction projection, which constrains gradients to remain within a clean reference subspace. Across reward-hacking experiments on mathematical reasoning, the proposed approach delays shortcut exploitation and better preserves task performance.

HuggingFace Daily Papers（社区热门论文）

54导出 Markdown

方向对齐缓解大语言模型强化学习中的奖励漏洞

2026-05-24 08:00·40天前

阅读原文· arxiv.org

AI 摘要

原文 · 保持原样，未翻译

arXiv安全/对齐推理