MIT研究解释为何扩展语言模型能如此可靠地提升性能

2026-05-03 16:42·60天前·Maximilian Schreiner

AI 摘要

MIT研究人员通过“叠加”现象为语言模型性能随规模扩大而可靠提升提供了机制性解释。研究表明，随着模型参数增加，神经网络能在同一神经元中高效编码更多概念，这种叠加效应使得模型能力呈现可预测的线性增长。该发现从数学层面解释了为何扩大GPT、Claude等模型规模能持续改善其理解和生成能力。

原文 · 未翻译

MIT study explains why scaling language models works so reliably

MIT researchers have a mechanistic explanation for why large language model performance scales so reliably with size. The answer comes down to a phenomenon called superposition.

The observation that bigger models perform better is one of the most consistent findings in AI research. Double the parameters, training data, or compute, and a language model's prediction error drops following a power law. These so-called "Neural Scaling Laws" drive the push to build ever-larger systems. But why they exist in the first place has never been fully explained.

A study presented at NeurIPS 2025 by Yizhou Liu, Ziming Liu, and Jeff Gore from MIT traces the phenomenon back to a geometric property built into the models themselves: superposition.

Language models pack more concepts than they have room for

Language models need to fit tens of thousands of tokens and even more abstract meanings into an internal space that only has a few thousand dimensions. In theory, a three-dimensional space can only hold three concepts without interference. LLMs get around this limitation by storing many concepts simultaneously in the same dimensions. The resulting vectors overlap slightly. This squeezing of multiple meanings into too little space is what researchers call superposition.

Until now, many explanations assumed that only the most common concepts get cleanly represented while the rest is lost ("weak superposition"). The MIT team shows, using a simplified model from Anthropic, that this picture doesn't match how real LLMs actually work.

Two regimes offer two different explanations

The researchers built a heavily simplified AI model with a training dial that let them control how much stored concepts were allowed to overlap. This made it possible to compare two extreme cases.

In the first case—weak superposition—the model only stores the most common concepts cleanly and ignores the rest. Prediction error here comes mainly from the rare concepts that get dropped. Whether performance scales cleanly as a power law depends on how concepts are distributed in the training data. Only when that distribution itself follows a power law does the error follow one too. The paper calls this "power law in, power law out."

In the second case—strong superposition—the model stores all concepts at once by letting their vectors overlap slightly. The error no longer comes from missing concepts but from the noise created by these overlaps. Here, a robust pattern emerges: doubling the model's width roughly cuts the error in half, predicted by a simple geometric relationship (1/m, where m is the model's width). How concepts are distributed in the data barely matters anymore.

The Decoder：AI News（RSS）

44导出 Markdown