Rohan Paul@rohanpaul_ai

2026-06-25 18:02·7天前

AI 摘要

Stanford、MIT、Harvard与Anthropic联合论文从训练层面解释大模型能力更强的原因：大模型遗忘更少，额外容量保护了弱学习信号。常见任务优先占据神经元，罕见任务在出现足够次数前被覆盖。小模型可能短暂捕捉罕见信号，但随后被常见任务更新覆盖。实验使用OLMo模型（4M到4B参数），结果显示大模型更好掌握低频任务，保留更多任务特征，梯度干扰更小。

Great Stanford + MIT + Harvard + Anthropic paper.

Gives a clear training-based reason for why larger models learn abilities smaller models miss.

Says bigger AI models learn rare skills because they forget them less during training， their extra space protects weak learning signals.

The authors say the issue is not just whether a small model could represent the task， but whether training lets it keep that task while many common tasks keep pushing on the same limited parts.

Their core idea is that common tasks take up the model's neurons first， so rare tasks get overwritten before they appear often enough to build into stable knowledge.

In a crowded data mixture， common patterns get first claim on the model's internal machinery.

Small models may briefly pick up a rare signal， but the next wave of common-task updates overwrites it before the signal appears again.

They tested this first with controlled toy tasks where they could change how rare and complex each task was， then with OLMo language models from 4M to 4B parameters.

The main result is that bigger models learned low-frequency tasks much better， kept more task features inside their representations， and showed less gradient interference， which means common-task updates disturbed rare-task learning less.

Larger models can remember weak rare signals long enough to turn them into real learned skills.

----

Link - arxiv. org/abs/2605.29548

Title： "Why Larger Models Learn More： Effects of Capacity， Interference， and Rare-Task Retention"

Anthropic 数据/训练论文/研究

在 X 查看原推导出 Markdown

Rohan Paul@rohanpaul_ai · X

49导出 Markdown