Qwen 发布关于强化学习编码智能体的新工作,指出 LLM 的奖励黑客问题。他们系统研究了编码智能体中的各种奖励信号——测试通过率、LLM 评判器和执行轨迹,发现每种信号都存在一个“地平线”:超出该界限后,信号不再跟踪真实正确性,而是被奖励黑客利用。论文认为长周期编码的奖励设计本质上是地平线问题,指标的选择不如它能持续跟踪正确性的时长重要。
Qwen publishes new work on RL coding agents.
(bookmark it)
The idea is to continually build a verification system that co-evolves with AI agents.
LLMs suffer from all sorts of reward hacking issues. This work studies coding-agent reward signals, test pass rates, LLM judges, and execution traces, and shows each one has a horizon beyond which it stops tracking real correctness and starts getting hacked.
They report that reward design for long-horizon coding is really a horizon problem. The metric you pick matters less than how long it keeps tracking correctness, and the paper finds where each signal crosses that line.
Paper: https://arxiv.org/abs/2606.26300
Learn to build effective AI agents in our academy: https://academy.dair.ai/