# Spectral Forcing：通过输入侧频谱先验提升像素空间扩散模型效率

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-16 08:00
- AIHOT 分数：38
- AIHOT 链接：https://aihot.virxact.com/items/cmqhiqg2b04s0sle1fxh8a4br
- 原文链接：https://arxiv.org/abs/2606.15236

## AI 摘要

像素空间扩散模型训练面对全频带噪声图像，而有效信号具有强频率依赖性。本文提出 Spectral Forcing，即在 patch embedder 前对噪声输入施加时间条件 2D-DCT 低通算子，其截止频率随扩散时间单调扩展，在数据端点退化为恒等映射。该方法使去噪器无需内部学习频带边界，从而缓解容量分配问题。在 ImageNet-256 搭配 JiT-700M/32 上，不同训练轮次均一致提升 FID 和 Inception Score；粗 patch 分词化下收益显著，细分词化时仍有竞争力。将该算子直接插入统一文生图模型 SenseNova-U1，同样改进了 DPG-Bench 与 GenEval，表明输入侧频谱先验可迁移至类条件生成之外。

## 正文

Pixel-space diffusion models are trained on full-bandwidth noisy images, yet the useful signal available to the denoiser is strongly frequency dependent. Under rectified-flow diffusion and natural-image power-law spectra, the per-band data-to-noise contour k^{*}(t) = (1-t)^{-2/α} separates a signal-bearing low-frequency region from a noise-dominated high-frequency region at each time t. We show that this implicit coarse-to-fine structure is not merely descriptive: it induces a capacity-allocation problem. A standard pixel-space denoiser must discover the moving bandwidth boundary internally and can spend computation on frequency-time regions where the optimal prediction collapses to deterministic baselines rather than data-distribution modeling. To make this boundary explicit, we introduce Spectral Forcing, a parameter-free, time-conditional 2D-DCT low-pass operator applied to the noisy input before the patch embedder. Its cutoff expands monotonically with the diffusion time and becomes the identity at the data endpoint. Through controlled synthetic experiments, we identify the regime in which the operator is beneficial: coarse patch tokenization and data whose high-frequency content is predominantly noise rather than essential signal. On ImageNet-256 with JiT-700M/32, Spectral Forcing consistently improves both FID and Inception Score across different training epochs, demonstrating robust gains throughout training; at finer tokenization, the spectral forcing is still competitive. We further insert the unchanged operator into SenseNova-U1, a unified text-to-image model, where it improves DPG-Bench and GenEval, showing that the input-side spectral prior transfers beyond class-conditional generation. These results suggest a route to capacity-efficient pixel-space diffusion by showing the signal and hiding the noise.
