# 可变宽度Transformer

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-16 08:00
- AIHOT 分数：54
- AIHOT 链接：https://aihot.virxact.com/items/cmqhgiko2046xsle1tddo0mio
- 原文链接：https://arxiv.org/abs/2606.18246

## AI 摘要

提出一种“times-shaped”瓶颈结构的Variable-Width Transformers，在语言模型深度方向非均匀分配容量。该架构在语言建模损失上优于参数匹配的均匀基线，平均层宽降低使总FLOPs减少22%，KV缓存内存和I/O成本减少15%。残差流中的表示分析显示瓶颈结构导致定性不同的表征。实验表明非均匀宽度分配可实现更资源最优的语言模型扩展。

## 正文

Scaling model size, specifically depth and width, has driven significant progress in transformer-based language models. However, most architectures maintain a constant width across all layers, allocating a fixed parameter and computation budget evenly despite different layers potentially playing distinct computational roles. In this work, we empirically investigate nonuniform capacity allocation across network depth by proposing a times-shaped > <former consistently outperforms parameter-matched uniform baselines on language modeling loss. By reducing the average layer width, this architecture also requires fewer overall FLOPs (22% reduction under fitted loss-matched scaling curves) and smaller KV cache memory and I/O cost (15% reduction). In analysis, we show that this bottleneck structure results in qualitatively different representations in residual streams. Overall, our results demonstrate that nonuniform width allocation can result in more resource-optimal scaling of language models.