# 大模型为何能学会更多：容量、干扰与罕见任务保持效应

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-05-28 08:00
- AIHOT 分数：65
- AIHOT 链接：https://aihot.virxact.com/items/cmpqd5x5k03xuslnoteaq6vek
- 原文链接：https://arxiv.org/abs/2605.29548

## AI 摘要

本研究探讨了更大模型能学习小模型无法掌握任务的原因。通过合成数据实验发现，小模型因神经元资源有限，倾向于将其分配给高频或低复杂度任务，导致其在罕见复杂任务上表现不佳，即使存在可表达该任务的潜在解。大模型则通过一种减弱的干扰机制克服此瓶颈：它们能为常见任务分配足够资源，使得相关梯度更新变弱，从而让罕见任务特征得以缓慢积累而不被覆盖。使用OLMo模型（4M至4B参数）在新任务上的预训练验证了这一结论：只有更大的模型学会了不频繁且复杂的任务，且这些模型在表征中嵌入了更多任务特征，任务间的梯度干扰更少。

## 正文

Larger models learn tasks smaller models do not. What drives this phenomenon? We develop a simple phenomenological argument that power-law scaling already suggests that a larger model will be able to learn a part of the data distribution that a smaller model fails to learn, even with infinite training data. To validate this claim and identify its causes, we study the effects of model scaling on a synthetic setup consisting of a mixture of tasks that show monotonic scaling curves. The results point to a data-induced competition over resources (neurons). Specifically, smaller models allocate their neurons to high frequency or low complexity tasks, and so they learn solutions that perform poorly on rare and complex tasks. Moreover, this happens even when solutions capable of expressing the desired task exist. We then assess how a larger model circumvents this data-centric bottleneck, finding that it traces to a reduced interference mechanism: larger models can allocate enough resources to common tasks that the gradient updates for those tasks become weak, which means that they do not overwrite rare-task features as they slowly accumulate. Finally, to further validate these claims, we pretrain OLMo models (4M to 4B parameters) on novel tasks of varying frequency and complexity. The results mirror those from our synthetic data experiments: only the larger OLMo models learn the infrequent and complex tasks, and these larger models embed more task features in their representations and show less gradient interference between tasks. Overall, we offer a data-centric account of why larger models learn tasks that smaller models fail to. This helps explain why larger models are better in practice, and it can inform practical questions concerning model sizing and training data mixtures.
