Nathan Lambert@natolambert

2026-06-19 22:25·13天前

AI 摘要

Nathan Lambert 评论称 RL speedrun 终将成为常态，当前最大瓶颈是价格——单次 RL 实验因不稳定导致噪声大，多次种子运行成本约 100 美元。@jeankaddour 随后推出 Sokoban Speedrun 项目：基于 Karpathy 的 nanochat 流水线修改，用 RL 训练 Qwen3-4B-Instruct 解决 Sokoban 谜题，GRPO 基线在 8×H100 上仅需 87 分钟。该尝试展示低成本快速验证 RL 方法的潜力。

It's obvious that eventually a speedrun for RL will stick.

I currently think the biggest bottleneck is price， as a individual entry currently has too much noise from instability of RL， so running multiple seeds makes it cost O（$100）.

Glad to see attempts！

Jean KaddourWith RSI around the corner, it's time for an RL speedrun. Introducing Sokoban Speedrun: training Qwen3-4B-Instruct with RL to solve Sokoban puzzles. We start by...

大佬观点数据/训练

在 X 查看原推导出 Markdown

Nathan Lambert@natolambert · X

49导出 Markdown