百度论文提出将开放式任务(如写作、主观回答)重构为可验证的多项选择形式,通过"两两比较"替代直接评分,为RL提供清晰奖励信号。在7个基准测试中,14B模型平均比RLHF基线高3.29分。关键创新在于训练任务形式的改变——模型通过对比验证学习识别优劣,而非单纯吸收偏好对。研究同时发现需混合RLHF目标以防止输出长度坍缩。该方法表明,用结构化比较替代模糊评分可能是提升推理能力的普遍对齐策略。
This Baidu paper found a way to use the clean, reliable rewards of RL on tasks like writing and subjective answers, where there is usually no single "correct" output.
Instead of asking "is this response correct?", they ask "which of these two responses is better?", and that simple reformulation appears to improve open-ended reasoning better than standard reward-model training on their benchmarks.
i.e. it turns open-ended writing into verifiable choices, and RL starts working there too.
Across seven open-ended benchmarks, the method beats a matched RLHF baseline by an average 3.29 points on a 14B reasoning model.
The clever part is not a better reward model.
It is a change in what the model is asked to do during training.