# 超越数学与代码的可验证奖励：面向事实性问答的轻量级基于语料库的过程监督

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-05-28 08:00
- AIHOT 分数：68
- AIHOT 链接：https://aihot.virxact.com/items/cmpqjlfr105k0slno3cobo2a8
- 原文链接：https://arxiv.org/abs/2605.29648

## AI 摘要

针对强化学习提升事实性问答准确度时的奖励设计难题，本文提出CorVer方法。它用基于Wikipedia共现统计的轻量级语料库信号，替代了昂贵且不可靠的神经验证器（如NLI或LLM判断器）。CorVer为每个句子分配信用值，并通过简单对齐映射到token级优势，仅需一个0.5B的提取器和单次语料库查询。在覆盖六个指令微调模型和五个问答基准的30个测试组合中，CorVer使每个组合的性能均优于原始基线，其中TriviaQA平均提升+4.1个百分点。在可行配置下，它在20个组合中的18个超越了神经验证器基线，且训练速度快4.8至8.4倍。

## 正文

Applying reinforcement learning to improve factual accuracy in knowledge-intensive question answering faces a reward design dilemma. Response-level rewards provide only coarse supervision and cannot distinguish correct from incorrect statements within a reasoning trace. Sentence-level alternatives offer finer-grained feedback, but typically rely on NLI verifiers, LLM judges, or knowledge-verification pipelines that are expensive to deploy at RL scale and often unreliable for rare-entity facts, where accurate reward signals are especially important. We propose CorVer (Corpus Verify), a lightweight, plug-in-ready process reward that replaces neural verifiers with a corpus-grounded signal derived from Wikipedia co-occurrence statistics. CorVer assigns sentence-level credit and maps it to token-level advantages via a simple alignment, requiring only a 0.5B extractor and a single corpus lookup per sentence. Across 30 (model, benchmark) cells spanning six instruction-tuned models (3B to 14B) and five QA benchmarks, CorVer improves over the raw baseline for every cell, with an average TriviaQA gain of +4.1 pp. It also outperforms four neural-verifier baselines in 18 of 20 cells under their feasible configurations, while training 4.8 to 8.4x faster.
