SEGA: 基于频谱-能量引导注意力的扩散 Transformer 分辨率外推方法
阅读原文· arxiv.org研究团队提出一种名为 SEGA 的无训练方法,用于解决扩散 Transformer 在生成超出训练分辨率图像时性能下降的问题。该方法根据去噪过程中潜变量的空间-频谱结构,对旋转位置编码的不同频率分量进行动态、自适应的注意力缩放,从而在提升图像全局结构连贯性的同时,更好地恢复细节保真度。实验表明,SEGA 在多种目标分辨率上均能稳定提升高分辨率图像合成质量,优于当前最先进的无训练基线方法。
Diffusion transformers (DiTs) have emerged as a dominant architecture for text-to-image generation, yet their performance drops when generating at resolutions beyond their training range. Existing training-free approaches mitigate this by modifying inference-time attention behavior, often through Rotary Position Embeddings (RoPE) extrapolation combined with attention scaling. However, these strategies apply a uniform and content-agnostic scaling across RoPE components with distinct frequency characteristics, inducing a trade-off between preserving global structure and recovering fine detail. We introduce SEGA, a training-free method that dynamically scales attention across RoPE components according to the latent's spatial-frequency structure at each denoising step. This adaptive scaling improves both structural coherence and fine-detail fidelity. Experiments show that SEGA consistently improves high-resolution synthesis across multiple target resolutions, outperforming state-of-the-art training-free baselines.