Rohan Paul@rohanpaul_ai

2026-05-28 19:03·35天前

AI 摘要

传统Diffusion Transformers因层间信息传递方式固化导致训练效率低下。研究团队提出Diffusion-Adaptive Routing方法，允许每层动态选择使用哪些早期层的输出，且该选择随去噪时间步调整。该方法未引入新的数据集、损失函数或注意力机制，仅通过优化残差连接，使得相同图像质量所需的训练迭代次数减少8.75倍。

Image diffusion Transformers train poorly because their layers pass information in a fixed， outdated way.

Now they can train much faster by changing how layers share information.

With this paper， the same image quality arrived with 8.75x fewer training iterations.

The surprise is not that Diffusion Transformers had an inefficiency， but where it was hiding.

Researchers have spent years refining attention， conditioning， tokenization， objectives， and autoencoders， while leaving the residual stream mostly untouched because it looked like plumbing rather than intelligence.

In a standard residual stack， every layer keeps adding its output to the running stream， which sounds harmless until the stream's magnitude swells， gradients fade backward， and neighboring blocks begin saying nearly the same thing.

That is bad for any Transformer， but it is especially awkward for diffusion， because denoising is not one fixed task repeated at every step.

The authors found 3 signs that this old setup hurts the model： signals get too large going forward， learning signals fade going backward， and nearby blocks often produce almost the same features.

Their fix is Diffusion-Adaptive Routing， a replacement that lets each layer choose which earlier layer outputs to use， and the choice changes with the denoising timestep.

The big deal is that the paper does not add a new image dataset， loss， tokenizer， or attention trick， but instead questions the old residual connection that most models kept copying from language Transformers. ----

Link - arxiv. org/abs/2605.20708

Title： "Rethinking Cross-Layer Information Routing in Diffusion Transformers"