大语言模型作为噪声信道：从香农视角看模型容量与缩放定律

2026-05-22 08:00·42天前

AI 摘要

现有大语言模型缩放定律多为单调幂律，无法解释灾难性过训练或量化等非单调现象。研究提出Shannon Scaling Law，将LLM训练建模为基于Shannon-Hartley定理的噪声信道信息传输，模型参数映射为信道带宽，训练token映射为信号功率。该框架揭示LLM存在基本容量极限：若无法维持足够信噪比，盲目扩大规模将放大噪声，导致性能从单调改进转为U型退化。在Pythia和OLMo2模型上的实验验证了该定律能准确捕捉性能谷底，并具备外推能力：用不超过6.9B参数、180B token训练的模型，可预测未见过的12B模型在307B token时的表现，池化R²达0.847。

原文 · 未翻译

Existing scaling laws for Large Language Models (LLMs), predominantly monotonic power laws, fail to explain emerging non-monotonic phenomena such as catastrophic overtraining and quantization-induced degradation, where performance deteriorates despite increased compute. We propose the Shannon Scaling Law, a unified theoretical framework that models LLM training as information transmission over a noisy channel, grounded in the Shannon-Hartley theorem. By mapping model parameters to channel bandwidth and training tokens to signal power, our formulation explicitly captures the interaction between learning signal and intrinsic noise. This perspective reveals a fundamental Shannon capacity for LLMs: scaling model size or data without preserving a sufficient signal-to-noise ratio (SNR) inevitably amplifies noise, inducing a transition from monotonic improvement to U-shaped performance degradation. We validate our theory through experiments on Pythia and OLMo2 under perturbations, including Gaussian noise, quantization and supervised fine-tuning on math, QA and code tasks. The Shannon Scaling Law consistently outperforms classical scaling laws and recent perturbation-aware laws, achieving strong R^2 scores and accurately capturing loss basins missed by prior approaches. It also extrapolates: fitted on leq6.9B Pythia models with leq180B tokens, it predicts the unseen 12B model up to 307B tokens at pooled R^2{=}0.847, while monotonic baselines collapse.

HuggingFace Daily Papers（社区热门论文）

60导出 Markdown

大语言模型作为噪声信道：从香农视角看模型容量与缩放定律

2026-05-22 08:00·42天前

阅读原文· arxiv.org

AI 摘要

原文 · 保持原样，未翻译