Stratagem:通过轨迹调制的游戏自博弈学习可迁移推理
阅读原文· arxiv.orgSTRATAGEM 框架通过轨迹调制的游戏自博弈提升语言模型的可迁移推理能力。针对现有方法仅依赖游戏终端结果、难以区分通用推理与特定启发式的问题,该框架引入推理可迁移性系数和推理进化奖励,选择性强化展现抽象领域无关推理的轨迹,并激励自适应推理发展。实验显示,该方法在数学推理、通用推理和代码生成基准测试中取得显著改进,尤其在竞赛级数学任务上表现突出。
Games offer a compelling paradigm for developing general reasoning capabilities in language models, as they naturally demand strategic planning, probabilistic inference, and adaptive decision-making. However, existing self-play approaches rely solely on terminal game outcomes, providing no mechanism to distinguish transferable reasoning patterns from game-specific heuristics. We present STRATAGEM, which addresses two fundamental barriers to reasoning transfer: domain specificity, where learned patterns remain anchored in game semantics, and contextual stasis, where static game contexts fail to cultivate progressive reasoning. STRATAGEM selectively reinforces trajectories exhibiting abstract, domain-agnostic reasoning through a Reasoning Transferability Coefficient, while incentivizing adaptive reasoning development via a Reasoning Evolution Reward. Experiments across mathematical reasoning, general reasoning, and code generation benchmarks demonstrate substantial improvements, with particularly strong gains on competition-level mathematics where multi-step reasoning is critical. Ablation studies and human evaluation confirm that both components contribute to transferable reasoning.