DiffusionBench:扩散Transformer的整体评估基准
阅读原文· arxiv.org当前扩散Transformer(DiT)研究集中于ImageNet类别条件生成单一评估设置,方法排名与文生图(T2I)任务间无强相关。NanoGen框架统一了DiT训练与评估:在ImageNet上匹配SOTA基线,仅需修改12行配置即可训练T2I模型,两种任务训练计算量相当。基于NanoGen训练21个潜在扩散模型后,三个指标上ImageNet与T2I排名间的Pearson相关系数为-0.377至-0.580,表明仅靠ImageNet FID改进未必反映T2I真实进步。为此整合ImageNet与T2I结果形成DiffusionBench,作为替代单一ImageNet评估的DiT整体基准。
Diffusion transformer (DiT) research on image generation has converged to a single evaluation setup: class-conditional generation on ImageNet. While methods improve the FID and related metrics, it is increasingly unclear whether they reflect real progress in generative modeling. The natural alternative, i.e., text-to-image (T2I) generation, is perceived as too costly or inconvenient to train and evaluate and is often skipped. We argue that this perception no longer holds. We introduce NanoGen, a unified DiT training and evaluation framework. NanoGen matches state-of-the-art DiT baselines on ImageNet and, with 12 lines of configuration change, also trains competitive text-to-image models. It currently supports RAE, VAE, pixel-space, and MeanFlow diffusion methods under both ImageNet and T2I setups. Under NanoGen, training T2I requires comparable compute to ImageNet. After training 21 latent diffusion models with NanoGen, we observe that method ranking shows no strong correlation between ImageNet and T2I generation: Pearson correlation is between -0.377 and -0.580 across three metrics. This suggests that a method which improves class-conditional ImageNet FID may show no corresponding improvement on T2I, clearly indicating the necessity of evaluating DiTs on both tasks. To this end, we summarize ImageNet and text-to-image results, which yields DiffusionBench, a holistic benchmark for DiT research. We recommend reporting DiffusionBench in place of ImageNet alone: methods that improve DiffusionBench are more likely to reflect broader progress.