Nathan Lambert 评论称 RL speedrun 终将成为常态,当前最大瓶颈是价格——单次 RL 实验因不稳定导致噪声大,多次种子运行成本约 100 美元。@jeankaddour 随后推出 Sokoban Speedrun 项目:基于 Karpathy 的 nanochat 流水线修改,用 RL 训练 Qwen3-4B-Instruct 解决 Sokoban 谜题,GRPO 基线在 8×H100 上仅需 87 分钟。该尝试展示低成本快速验证 RL 方法的潜力。
It's obvious that eventually a speedrun for RL will stick.
I currently think the biggest bottleneck is price, as a individual entry currently has too much noise from instability of RL, so running multiple seeds makes it cost O($100).
Glad to see attempts!