RL顿悟配方：如何让大模型通过强化学习攻克无解任务？

2025-10-02 00:00·274天前

AI 摘要

研究团队发布合成编程基准DELTA与Manufactoria测试平台，针对基础模型pass@128为零的分布外任务，提出两阶段奖励调度方案：先以密集每测试奖励打破零梯度僵局，再切换至二元全通奖励巩固精确解。实验显示，RL训练在漫长平台期后会出现"grokking"式相变，准确率骤升至约100%，证明模型能够发现全新策略而非仅优化已有知识。迁移测试表明，习得策略可重组编程子技能并外推至更难参数范围，但在需要新不变量的结构性转变上仍受限。

原文 · 未翻译

RL Grokking Recipe: How Can We Enable LLMs to Solve Previously Unsolvable Tasks with RL?

TL;DR

We introduce DELTA: a controlled suite of synthetic programming families with fully OOD splits and verifiable rewards. DELTA lets us ask two crisp questions: Learnability (can RL solve families where the base model has pass@K=0?) and Transferability (do the learned procedures generalize?)

On several pass@128=0 families, RL exhibits a grokking-like phase transition: after a long near-zero-reward plateau, accuracy snaps to ~100%. That is discovery, not mere sharpening.

A two-phase reward schedule is key: dense per-test rewards to escape the “all-zero” region, then binary full-pass to consolidate exact solutions. Binary-only gets stuck; dense-only hovers at “almost right.” The schedule yields the grokking jump.

Transfer is selective: RL-trained policies recompose programming sub-skills and extrapolate to harder parametric regimes, but struggle on transformative shifts that require new invariants.

Manufactoria: a pure OOD learnability testbed

Berkeley RDI：Blog（AI 安全与评测）

导出 Markdown