# 锥形语言模型（TLM）

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-22 08:00
- AIHOT 分数：52
- AIHOT 链接：https://aihot.virxact.com/items/cmqq6uzx50759slp5h64724bp
- 原文链接：https://arxiv.org/abs/2606.23670

## AI 摘要

现代语言模型（Transformer、循环、记忆变体）默认在深度上均匀分配参数。固定预算下的实验表明，早期层分配更多容量、后期层更少容量可改善困惑度，反向分配则有害。基于此提出锥形语言模型（TLM）架构原则：对MLP宽度通过平滑余弦调度进行单调锥形缩减。在三种模型规模和四种架构（Transformer、Gated Attention、Hope-attention、Titans）上，TLM一致优于均匀宽度基线，提升困惑度和下游基准性能，且不增加参数或计算成本。

## 正文

Modern language models, including transformer, recurrent, and memory-based variants, share a common chassis: a stack of identical layers in which parameters are allocated uniformly across depth. This is a default inherited from the original transformer and largely unchanged since, yet a growing body of evidence suggests that layers contribute non-uniformly to the final output, with later layers refining the residual stream rather than transforming it. We ask whether parameter capacity should reflect this asymmetry. Our controlled experiment shows that, under a fixed budget, allocating more capacity to earlier layers and less to later layers improves perplexity over a uniform-width baseline, while the reverse allocation hurts. Building on this result, we introduce Tapered Language Models (TLMs), an architectural principle in which a parameter-bearing component is monotonically tapered across depth under a fixed total budget. MLPs are the natural site for this instantiation: they dominate parameter count across all modern LM families and expose width as a single, clean axis of variation. Across three model scales and four architectures (Transformer, Gated Attention, Hope-attention, and Titans), tapering MLP width via a smooth cosine schedule consistently improves perplexity and downstream benchmark performance over uniform baselines, at no additional parameter or compute cost. These findings establish depth-aware capacity allocation as a simple, architecture-agnostic axis of language model design, a free lever hidden in plain sight.