原文 · 未翻译
MIT study explains why scaling language models works so reliably
MIT researchers have a mechanistic explanation for why large language model performance scales so reliably with size. The answer comes down to a phenomenon called superposition.
The observation that bigger models perform better is one of the most consistent findings in AI research. Double the parameters, training data, or compute, and a language model's prediction error drops following a power law. These so-called "Neural Scaling Laws" drive the push to build ever-larger systems. But why they exist in the first place has never been fully explained.
A study presented at NeurIPS 2025 by Yizhou Liu, Ziming Liu, and Jeff Gore from MIT traces the phenomenon back to a geometric property built into the models themselves: superposition.
Language models pack more concepts than they have room for
Language models need to fit tens of thousands of tokens and even more abstract meanings into an internal space that only has a few thousand dimensions. In theory, a three-dimensional space can only hold three concepts without interference. LLMs get around this limitation by storing many concepts simultaneously in the same dimensions. The resulting vectors overlap slightly. This squeezing of multiple meanings into too little space is what researchers call superposition.
Until now, many explanations assumed that only the most common concepts get cleanly represented while the rest is lost ("weak superposition"). The MIT team shows, using a simplified model from Anthropic, that this picture doesn't match how real LLMs actually work.
Two regimes offer two different explanations
The researchers built a heavily simplified AI model with a training dial that let them control how much stored concepts were allowed to overlap. This made it possible to compare two extreme cases.
In the first case—weak superposition—the model only stores the most common concepts cleanly and ignores the rest. Prediction error here comes mainly from the rare concepts that get dropped. Whether performance scales cleanly as a power law depends on how concepts are distributed in the training data. Only when that distribution itself follows a power law does the error follow one too. The paper calls this "power law in, power law out."
In the second case—strong superposition—the model stores all concepts at once by letting their vectors overlap slightly. The error no longer comes from missing concepts but from the noise created by these overlaps. Here, a robust pattern emerges: doubling the model's width roughly cuts the error in half, predicted by a simple geometric relationship (1/m, where m is the model's width). How concepts are distributed in the data barely matters anymore.