Off-the-Shelf LLMs as Process Scorers： Training-Free Alternative to PRMs for Mathematical Reasoning

2026-06-01 08:00·32天前

AI 摘要

Chunk-Level Guided Generation 是一种无需训练的推理时方法，它利用现成的大语言模型（如 Qwen2.5-32B 或 Llama-3.1-70B）作为过程评分器，引导小模型进行数学推理。该方法在每个步骤让小模型生成多个固定长度候选块，由大模型通过似然度评分选择，从而提前引导推理方向，避免错误传播。它包含似然引导选择（LGS）和对比引导选择（CGS）两种规则，其中CGS通过减去小模型似然度来偏好与大模型偏好不同的块。在多个基准测试中，该方法在匹配计算预算下，性能匹配或优于需要奖励模型训练的PRM引导搜索，并且生成的推理轨迹显著更短。

原文 · 未翻译

Selecting the best response from multiple small-model samples using a stronger scorer is a simple inference-time strategy, but fails when the small model has already committed to incorrect reasoning paths. PRM guided search avoids this by scoring candidate continuations during generation, but requires a reward model trained with step-level labels. We propose Chunk-Level Guided Generation, a training-free alternative that uses an off-the-shelf large language model as a process scorer. At each step, a small model samples k fixed-length candidate chunks, while the larger model scores the candidates using likelihoods without generating any text. The selected chunk is committed before the next step, steering generation before errors can propagate. We instantiate this framework with two selection rules: Likelihood-Guided Selection (LGS), which selects the chunk with the highest length-normalized large-model log-probability, and Contrastive-Guided Selection (CGS), which subtracts the small model's log-probability to favor chunks where the large model's preference diverges from the small model's. We show that scoring variable-length reasoning steps with large-model likelihoods is unreliable due to a systematic length bias that persists even after length normalization, and that fixed-length chunks avoid this confound. On GSM8K, MATH, Minerva Math, AMC23, and AIME24 with Qwen2.5-1.5B guided by Qwen2.5-32B and Llama-3.2-1B guided by Llama-3.1-70B, CGS outperforms majority voting by up to 28 pp and, under matched guidance budgets, matches or outperforms Qwen2.5-Math-PRM-72B guided search on most benchmarks without reward-model training. With Qwen2.5-7B guided by Qwen2.5-72B, CGS reaches 81.8% on MATH and 63.6% on Minerva Math at k=16, surpassing majority voting by 4--6 pp. Finally, Chunk-Level Guided Generation produces substantially shorter reasoning traces than PRM guided search.

HuggingFace Daily Papers（社区热门论文）

69导出 Markdown

Off-the-Shelf LLMs as Process Scorers： Training-Free Alternative to PRMs for Mathematical Reasoning

2026-06-01 08:00·32天前

阅读原文· arxiv.org

AI 摘要

原文 · 保持原样，未翻译