Stratagem：通过轨迹调制的游戏自博弈学习可迁移推理

2026-04-20 08:00·74天前

AI 摘要

STRATAGEM 框架通过轨迹调制的游戏自博弈提升语言模型的可迁移推理能力。针对现有方法仅依赖游戏终端结果、难以区分通用推理与特定启发式的问题，该框架引入推理可迁移性系数和推理进化奖励，选择性强化展现抽象领域无关推理的轨迹，并激励自适应推理发展。实验显示，该方法在数学推理、通用推理和代码生成基准测试中取得显著改进，尤其在竞赛级数学任务上表现突出。

原文 · 未翻译

Games offer a compelling paradigm for developing general reasoning capabilities in language models, as they naturally demand strategic planning, probabilistic inference, and adaptive decision-making. However, existing self-play approaches rely solely on terminal game outcomes, providing no mechanism to distinguish transferable reasoning patterns from game-specific heuristics. We present STRATAGEM, which addresses two fundamental barriers to reasoning transfer: domain specificity, where learned patterns remain anchored in game semantics, and contextual stasis, where static game contexts fail to cultivate progressive reasoning. STRATAGEM selectively reinforces trajectories exhibiting abstract, domain-agnostic reasoning through a Reasoning Transferability Coefficient, while incentivizing adaptive reasoning development via a Reasoning Evolution Reward. Experiments across mathematical reasoning, general reasoning, and code generation benchmarks demonstrate substantial improvements, with particularly strong gains on competition-level mathematics where multi-step reasoning is critical. Ablation studies and human evaluation confirm that both components contribute to transferable reasoning.

HuggingFace Daily Papers（社区热门论文）

导出 Markdown

Stratagem：通过轨迹调制的游戏自博弈学习可迁移推理

2026-04-20 08:00·74天前

阅读原文· arxiv.org

AI 摘要

原文 · 保持原样，未翻译