AI 摘要
Micah Goldblum 指出,batch size 为 1 的无动量 vanilla SGD(入门 ML 的首个优化器)在 LLM 预训练中,per-FLOP 速度几乎与 AdamW 相当。
The optimal batch size is 1 (For suitable definitions of "optimal")
🚨 Did you know that small-batch vanilla SGD without momentum (i.e. the first optimizer you learn about in intro ML) is virtually as fast as AdamW for LLM pretr...