Yann LeCun@ylecun

2025-07-12 05:08·356天前

AI 摘要

Micah Goldblum 指出，batch size 为 1 的无动量 vanilla SGD（入门 ML 的首个优化器）在 LLM 预训练中，per-FLOP 速度几乎与 AdamW 相当。

The optimal batch size is 1 （For suitable definitions of "optimal"）

Micah Goldblum🚨 Did you know that small-batch vanilla SGD without momentum (i.e. the first optimizer you learn about in intro ML) is virtually as fast as AdamW for LLM pretr...

Meta 大佬观点数据/训练

在 X 查看原推导出 Markdown

Yann LeCun@ylecun · X

导出 Markdown