# 研究揭示大型语言模型为何能学会小型模型无法掌握的技能

- 来源：The Decoder：AI News（RSS）
- 作者：Jonathan Kemper
- 发布时间：2026-06-07 15:45
- AIHOT 分数：57
- AIHOT 链接：https://aihot.virxact.com/items/cmq3hymox00n8slqldq67sekj
- 原文链接：https://the-decoder.com/researchers-pinpoint-why-larger-language-models-pick-up-skills-that-small-ones-miss

## AI 摘要

小型语言模型在罕见任务上表现不佳，因为频繁任务不断覆盖已学内容。一项涵盖4百万到40亿参数模型的新研究详细揭示了这一机制，并提出实用解决方案：无需扩大模型规模，只需增加目标任务在训练数据中的出现频率。

## 正文

Researchers pinpoint why larger language models pick up skills that small ones miss

A new study suggests that instead of endlessly inflating models, it may be more efficient to increase the frequency of specific tasks in training data to anchor rare skills in smaller models.

A new study from researchers at Anthropic, Stanford, and other institutions explains why larger language models learn certain tasks that smaller ones fail at. The finding goes beyond the conventional wisdom that big models simply learn faster.

In some cases, small models can't reliably learn rare tasks even with extremely long training runs. Even well-known scaling laws show that a small model never reaches the loss of a large one, no matter how much data you throw at it.

Common tasks crowd out rare ones

To isolate the mechanism, the researchers tested a mix of tasks with varying frequency and complexity. A model with N neurons gets assigned the N "most useful" features, where usefulness is based on how often a task appears and how important it is. Frequent, simple tasks get priority. Rare, complex ones get dropped. In the experiments, only models that were large enough learned tasks that made up just 0.25 percent of the training data.

The core of the paper is its explanation of why size helps. As long as frequent tasks aren't well-learned yet, they pull the model strongly in their direction at every training step, overwriting much of what the model picked up about rare tasks. Once a large model has mostly mastered the frequent tasks, that pull fades. The freed-up capacity goes to rare tasks, and learned signals are more likely to stick.

Small models rarely reach that point, according to the study. They fall into an "update-and-forget" loop. A rare example gets briefly learned, then largely erased by the next training steps on frequent tasks. When the next rare example shows up, the model starts over from scratch.

One experiment was designed to cleanly separate this effect. The total frequency of a rare task stays constant, but the gap between individual observations varies. The larger the gap, the more the signal decays in narrow models. Wide models hold onto it better between observations and build on it.

Real language models show the same pattern

To test the theory during pre-training, the team trained OLMo models ranging from 4 million to 4 billion parameters on up to 210 billion tokens from the Dolma corpus. They mixed two artificial tasks into the data, a number comparison and a modular addition, with frequencies ranging from about 1,000 instances per batch down to one instance every ten batches.

Only the larger OLMo models picked up the rare tasks by learning the rule behind them and applying it to new cases, rather than just memorizing individual examples.

This was especially clear with modular addition, where the researchers observed what's known as grokking. A model memorizes a task first, then suddenly clicks on the actual principle after more training. Only the bigger models hit that moment, and only when the task showed up often enough in the data.

A look inside the models tells the same story. In the one-billion-parameter model, every training step that included the rare task pushed clearly toward the right answer. In the 20-million-parameter model, that signal drowned in noise from everything else. Almost no real learning took place.

Memorization turns out to be a stepping stone

The study treats memorization as a prerequisite for generalization, rather than an unwanted side effect. A model needs to hold onto individual observations long enough for a broader pattern to take shape across many batches.

This offers a practical alternative to just making models bigger. Instead of scaling up the model, the frequency of a target task in the training data can be increased to anchor a specific skill, the research suggests.

There's more than one theory for why model size helps. In May, an MIT team tied scaling laws to model geometry, where models store more concepts through superposition than their dimensions should allow.

This new study starts from a different angle, focusing on what a model can actually learn from a given data mix during training. The older debate about whether abilities truly "emerge" in sudden jumps past a certain size, or whether that's partly a measurement artifact, is still playing out.

AI News Without the Hype – Curated by Humans
