# 通过可验证多项选择重构将RLVR扩展至开放式任务

- 来源：Rohan Paul (@rohanpaul_ai)
- 发布时间：2026-04-13 10:09
- AIHOT 链接：https://aihot.virxact.com/items/cmnwkxvlw0124sl6x1x1n247y
- 原文链接：https://x.com/rohanpaul_ai/status/2043511974416588985

## AI 摘要

百度论文提出将开放式任务（如写作、主观回答）重构为可验证的多项选择形式，通过"两两比较"替代直接评分，为RL提供清晰奖励信号。在7个基准测试中，14B模型平均比RLHF基线高3.29分。关键创新在于训练任务形式的改变——模型通过对比验证学习识别优劣，而非单纯吸收偏好对。研究同时发现需混合RLHF目标以防止输出长度坍缩。该方法表明，用结构化比较替代模糊评分可能是提升推理能力的普遍对齐策略。

## 正文

This Baidu paper found a way to use the clean， reliable rewards of RL on tasks like writing and subjective answers， where there is usually no single "correct" output.

Instead of asking "is this response correct？"， they ask "which of these two responses is better？"， and that simple reformulation appears to improve open-ended reasoning better than standard reward-model training on their benchmarks.

i.e. it turns open-ended writing into verifiable choices， and RL starts working there too.

Across seven open-ended benchmarks， the method beats a matched RLHF baseline by an average 3.29 points on a 14B reasoning model.

The clever part is not a better reward model.

It is a change in what the model is asked to do during training.

Instead of grading a poem or subjective answer directly， the system sees two candidate responses， one preferred and one rejected， and learns to identify which is better.

Multiple choice creates a clean binary signal， so the model can be trained with the same kind of verifiable reward that made RL powerful in math and code， without pretending open-ended tasks have one canonical answer.

The gain is probably not just better taste imitation. The paper's DPO ablation underperforms badly， which suggests the benefit comes from learning a contrastive verification habit， not merely absorbing preference pairs.

The authors also catch an important failure mode： train only on these choice tasks and responses get unnaturally short.

So they mix in a small RLHF objective to keep output length from collapsing， and the resulting model appears more useful rather than merely more terse.

The strongest claim here is not that open-ended evaluation is solved.

It is that reasoning can be improved when you replace fuzzy scoring with structured comparison， which may be a more general lesson for alignment than this paper admits.

----

Paper Link - arxiv. org/abs/2511.02463

Paper Title： "Extending RLVR to Open-Ended Tasks via Verifiable Multiple-Choice Reformulation"