# 斯坦福、MIT、哈佛与Anthropic联合论文：为什么更大模型能学到小模型学不会的罕见技能

- 来源：Rohan Paul (@rohanpaul_ai)
- 发布时间：2026-06-08 12:19
- AIHOT 分数：60
- AIHOT 链接：https://aihot.virxact.com/items/cmq4q15kt00nxslt2q6vbh3kx
- 原文链接：https://x.com/rohanpaul_ai/status/2063838176884519159

## AI 摘要

该论文指出，更大模型能学到罕见技能，是因为训练中遗忘更少，其额外容量保护了弱学习信号。核心机制：常见任务先抢占神经元，罕见任务在出现频率足够形成稳定知识前就被覆盖。小模型可能短暂捕捉到罕见信号，但随即被下一波常见任务更新覆盖。实验使用OLMo语言模型（4M–4B参数）验证：大模型在低频任务上表现更优，保留更多任务特征，且常见任务更新对罕见任务的梯度干扰更小。作者强调，问题不仅在于小模型能否表征任务，更在于训练中罕见任务能否在众多常见任务反复冲击下持续存在。

## 正文

Great Stanford + MIT + Harvard + Anthropic paper.

Gives a clear training-based reason for why larger models learn abilities smaller models miss.

Says bigger AI models learn rare skills because they forget them less during training， their extra space protects weak learning signals.

The authors say the issue is not just whether a small model could represent the task， but whether training lets it keep that task while many common tasks keep pushing on the same limited parts.

Their core idea is that common tasks take up the model's neurons first， so rare tasks get overwritten before they appear often enough to build into stable knowledge.

In a crowded data mixture， common patterns get first claim on the model's internal machinery.

Small models may briefly pick up a rare signal， but the next wave of common-task updates overwrites it before the signal appears again.

They tested this first with controlled toy tasks where they could change how rare and complex each task was， then with OLMo language models from 4M to 4B parameters.

The main result is that bigger models learned low-frequency tasks much better， kept more task features inside their representations， and showed less gradient interference， which means common-task updates disturbed rare-task learning less.

Larger models can remember weak rare signals long enough to turn them into real learned skills.

----

Link - arxiv. org/abs/2605.29548

Title： "Why Larger Models Learn More： Effects of Capacity， Interference， and Rare-Task Retention"