传统Diffusion Transformers因层间信息传递方式固化导致训练效率低下。研究团队提出Diffusion-Adaptive Routing方法,允许每层动态选择使用哪些早期层的输出,且该选择随去噪时间步调整。该方法未引入新的数据集、损失函数或注意力机制,仅通过优化残差连接,使得相同图像质量所需的训练迭代次数减少8.75倍。
Image diffusion Transformers train poorly because their layers pass information in a fixed, outdated way.
Now they can train much faster by changing how layers share information.
With this paper, the same image quality arrived with 8.75x fewer training iterations.
The surprise is not that Diffusion Transformers had an inefficiency, but where it was hiding.
Researchers have spent years refining attention, conditioning, tokenization, objectives, and autoencoders, while leaving the residual stream mostly untouched because it looked like plumbing rather than intelligence.
In a standard residual stack, every layer keeps adding its output to the running stream, which sounds harmless until the stream's magnitude swells, gradients fade backward, and neighboring blocks begin saying nearly the same thing.