# MaskAlign：Token子集表示对齐以实现高效扩散训练

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-07 08:00
- AIHOT 分数：57
- AIHOT 链接：https://aihot.virxact.com/items/cmqaef40q0kqoslldyv899otv
- 原文链接：https://arxiv.org/abs/2606.08788

## AI 摘要

针对扩散模型训练中噪声输入与干净参考特征的时间步信息不匹配问题，论文从token级视角发现：完整token对齐中梯度范数大的token具有稳定空间偏好，导致模型过度依赖完整干净图像token集。为此提出MaskAlign，训练时对随机采样的token子集施加表示对齐，减少对完整token集的依赖，增强鲁棒性；并引入轻量级预掩码token混合块，在掩码前跨token共享信息以缓解信息损失。实验表明该方法有效提升扩散Transformer的训练效率和生成质量。

## 正文

Representation alignment with pretrained vision models has recently shown strong potential for accelerating diffusion transformer training. By aligning intermediate diffusion features with clean-image representations from self-supervised vision encoders, existing methods improve convergence and generation quality. However, such alignment also introduces a non-trivial constraint: diffusion models operate on noisy inputs whose usable information varies across timesteps, while the reference features are extracted from clean images. In this paper, we revisit this mismatch from a token-level perspective. We find that, under full-token representation alignment, tokens with large alignment-gradient norms exhibit a stable spatial preference, suggesting that the alignment objective does not affect all tokens uniformly and may encourage the model to rely on the complete set of clean-image tokens. To address this issue, we propose MaskAlign, a token-subset representation alignment method that applies alignment to randomly sampled token subsets during training. By exposing the model to different token subsets across iterations, MaskAlign reduces the dependence of representation alignment on the complete token set and encourages alignment behavior that is more stable under token-subset perturbations. To mitigate the information loss caused by directly dropping tokens, we further introduce a lightweight pre-mask token mixing block that shares information across tokens before masking.