Stanford、MIT、Harvard与Anthropic联合论文从训练层面解释大模型能力更强的原因:大模型遗忘更少,额外容量保护了弱学习信号。常见任务优先占据神经元,罕见任务在出现足够次数前被覆盖。小模型可能短暂捕捉罕见信号,但随后被常见任务更新覆盖。实验使用OLMo模型(4M到4B参数),结果显示大模型更好掌握低频任务,保留更多任务特征,梯度干扰更小。
Great Stanford + MIT + Harvard + Anthropic paper.
Gives a clear training-based reason for why larger models learn abilities smaller models miss.
Says bigger AI models learn rare skills because they forget them less during training, their extra space protects weak learning signals.
The authors say the issue is not just whether a small model could represent the task, but whether training lets it keep that task while many common tasks keep pushing on the same limited parts.
Their core idea is that common tasks take up the model's neurons first, so rare tasks get overwritten before they appear often enough to build into stable knowledge.
In a crowded data mixture, common patterns get first claim on the model's internal machinery.