评估策略还是措辞？LLM数学推理中表面多样性与策略多样性之间的鸿沟

2026-06-29 08:00·4天前

AI 摘要

本文提出策略多样性（approach-level diversity），即同一问题正确解法在策略上的差异。通过人类校准的LLM法官框架，发现现有表面多样性指标无法可靠反映策略多样性，且该不匹配在多样性感知RLVR训练中延续——目标指标不变而策略多样性下降。策略多样的候选集可提升测试时扩展效果，但直接优化LLM法官多样性奖励会导致策略迎合法官偏好而非拓宽方法。策略多样性的直接优化仍是开放问题。工作揭示了表面信号与策略信号间的系统性偏离。

原文 · 未翻译

Diversity in LLM mathematical reasoning is critical for exploration, but common diversity metrics mostly capture surface-level variation rather than differences in how a problem is solved. We address this gap by introducing approach-level diversity: variation in strategies across correct solutions to the same problem. Using a human-calibrated LLM judge framework, we show that prior diversity measures are unreliable proxies for approach-level diversity, and this mismatch carries over to diversity-aware RLVR, where target metrics are preserved while approach-level diversity declines. Investigating when approach-level diversity helps and whether it can be directly induced, we find that approach-diverse candidate sets improve test-time scaling. However, optimizing an LLM judge diversity reward during training causes the policy to exploit judge-specific preferences rather than broaden its approaches, leaving direct optimization of approach-level diversity as an open problem. Together, our work introduces the notion of approach-level diversity and uncovers a systematic divergence between surface- and approach-level signals, marking a step toward LLMs that reason in genuinely diverse, human-like ways.

HuggingFace Daily Papers（社区热门论文）

55导出 Markdown

评估策略还是措辞？LLM数学推理中表面多样性与策略多样性之间的鸿沟

2026-06-29 08:00·4天前

阅读原文· arxiv.org

AI 摘要

原文 · 保持原样，未翻译