当更多采样反而有害：测试时扩展的模态天花板与相关性天花板

2026-06-27 08:00·6天前

AI 摘要

推理系统通过多次采样（测试时扩展）来回答难题，覆盖率随采样次数增加而上升，但系统必须选出唯一答案。选择精度存在上限——模态天花板，在数十次采样内投票结果即趋稳定；相关性天花板则更早达到。超出这两个天花板后，额外采样只会增加计算成本，甚至让模型更确信错误答案，形成“可识别性差距”：模型能产出但无法选出的正确回答。论文将这一截止点量化为有效样本数，指出瓶颈在于识别正确答案而非生成更多候选。

原文 · 未翻译

People overthink; language models over-sample, and the extra effort can talk both into a worse answer. Reasoning systems answer a hard question by sampling it many times (test-time scaling), and the more they draw, the more often a correct answer turns up somewhere, so coverage, the fraction of problems with at least one correct try, climbs and appears to be progress. But a deployed system must return one answer, and choosing it, not knowing which try is right, is selection; selection is capped, and past a point extra samples only make the model surer of a confident mistake, even as every draw adds cost. The gap between climbing coverage and stalled selection, the identifiability gap, is the answer a model can produce but not pick. So the real question is not whether to sample but how far, and the answer is: not far. For picking an answer, the vote has already settled within a few dozen draws, the modal ceiling; for scoring a benchmark, sooner still, the correlation ceiling. Beyond that, extra draws cost compute and add nothing, and can even make the answer worse. This paper turns the cutoff into a single number, the effective number of samples, that any sampling run already reveals. The bottleneck is recognizing a right answer, not generating one.

HuggingFace Daily Papers（社区热门论文）

66导出 Markdown

当更多采样反而有害：测试时扩展的模态天花板与相关性天花板

2026-06-27 08:00·6天前

阅读原文· arxiv.org

AI 摘要

原文 · 保持原样，未翻译