# JLT：潜在扩散Transformer中的清洁潜在预测

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-05-26 08:00
- AIHOT 分数：65
- AIHOT 链接：https://aihot.virxact.com/items/cmpo7oc4t03o4slv4oph8i6jn
- 原文链接：https://arxiv.org/abs/2605.27102

## AI 摘要

本文提出JLT，一个基于冻结FLUX.2 VAE编码的130M参数潜在扩散Transformer。研究对比了清洁潜在预测与速度预测DiT在相同表示与训练设置下的表现。分析表明，速度回归继承了各向同性目标协方差下限并放大低方差方向，而清洁预测则能抑制这些方向。在ImageNet 256x256上，JLT-B/1通过classifier-free guidance获得2.50的FID-50K分数，相比速度预测展现出显著优势。研究指出，潜在扩散中的预测目标是与表示相关的几何选择，而非可互换的代数参数化。

## 正文

Flow matching with clean-data prediction has shown that regressing the clean point can exploit low-dimensional structure more effectively than predicting an ambient noised quantity. We ask whether this principle remains useful after images are mapped into a learned latent space, where compression has already removed much of the raw pixel variability. We introduce JLT, a 130M latent diffusion Transformer over frozen FLUX.2 VAE codes, and compare clean-latent prediction with a matched velocity-prediction DiT under the same representation, backbone, and training settings. Although the three variables x, epsilon, and v are linearly convertible for a fixed corruption time, a local Gaussian analysis shows that velocity regression inherits an isotropic target-covariance floor and amplifies low-variance latent directions, while clean prediction damps them. On ImageNet 256 x 256, JLT-B/1 obtains FID-50K 2.50 with classifier-free guidance, with a large matched-target gap over velocity prediction. These results suggest that prediction targets in latent diffusion are representation-dependent geometric choices, rather than interchangeable algebraic parameterizations.