CHIAR-Former：明暗注意力--在黑暗中分配计算

2026-06-06 08:00·27天前

AI 摘要

CHIAR-Former 是一种 4 层混合 Transformer，根据每个 token 的谱熵将其路由至 DCT 谱混合或全自注意力（RBF 核混合在消融中被拒绝）。仅含 DCT+注意力的变体在 WikiText-103 上获得 Val PPL 36.54，相比全注意力基线（PPL 66.62）提升 45%，同时减少 62.5% 注意力 FLOPs。在 WikiText-2、IMDB 情感分类和 ListOps 上的评估表明，模型在大规模自然文本中因 token 多样性受益，而全注意力在小数据集和合成任务中仍占优势。

原文 · 未翻译

Standard transformers apply self-attention uniformly at every layer and token, regardless of whether the input requires dynamic cross-token interaction. We propose CHIAR-Former (Chiaroscuro Attention), a 4-layer hybrid transformer that routes each token to one of three operators - DCT spectral mixing, RBF kernel mixing, or full self-attention - based on per-token spectral entropy, a theoretically justified complexity signal. Through systematic ablation on WikiText-103, we discover routing collapse: the router consistently rejects RBF in favour of DCT and attention, revealing that spectral mixing and dynamic attention are complementary and sufficient. A purpose-designed DCT+Attention-only variant achieves Val PPL 36.54 on WikiText-103 - a 45% improvement over a full-attention baseline (PPL 66.62) at 62.5% fewer attention FLOPs. We extend evaluation to WikiText-2, IMDB sentiment classification, and synthetic ListOps operations, establishing a clear operating regime: CHIAR-Former excels on large-scale naturalistic text where token diversity supports spectral specialisation, while full attention retains an edge on small datasets and synthetic pattern-matching tasks. These findings - both the wins and the losses - together define when and why spectral routing earns its keep.

HuggingFace Daily Papers（社区热门论文）

59导出 Markdown

CHIAR-Former：明暗注意力--在黑暗中分配计算

2026-06-06 08:00·27天前

阅读原文· arxiv.org

AI 摘要

原文 · 保持原样，未翻译