# PiD：基于像素扩散的快速高分辨率潜在解码

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-05-22 08:00
- AIHOT 分数：57
- AIHOT 链接：https://aihot.virxact.com/items/cmpkktlor07w4sl010b2qddwf
- 原文链接：https://arxiv.org/abs/2605.23902

## AI 摘要

PiD是一种将潜在解码重构为条件像素扩散的解码器，统一了图像解码与上采样。它通过直接在高分辨率像素空间进行去噪，支持4倍及8倍上采样，并具有低延迟。模型采用轻量级sigma-aware适配器注入噪声潜在变量，允许提前终止潜在扩散过程；并利用DMD2进行蒸馏，将推理步骤压缩至4步。PiD兼容传统VAE潜在变量与语义潜在变量。在RTX 5090上，可将512x512潜在变量解码为2048x2048像素，耗时低于1秒，峰值内存13GB；在GB200 GPU上最快仅需210毫秒。

## 正文

Most practical high-resolution text-to-image systems, including latent diffusion and autoregressive models, perform generation in a compact latent space, and a decoder maps the generated latents back to pixels. Yet the latent-to-pixel decoder is reconstruction-oriented, optimized to invert the encoder rather than synthesize more details, and becomes increasingly costly at megapixel scale. This drawback calls for a more expressive and efficient decoding paradigm. Motivated by recent progress in scalable pixel-space diffusion, we introduce PiD, a Pixel diffusion Decoder that reformulates latent decoding as conditional pixel diffusion, unifying decoding and upsampling into one generative module. By denoising directly in high-resolution pixel space, PiD synthesizes 4times and even 8times upscaled images with low latency. For latent conditioning, a lightweight sigma-aware adapter injects noise-corrupted latents into the pixel diffusion backbone, enabling PiD to decode partially denoised latents and terminate the latent diffusion process early. To further improve efficiency, we distill the model using DMD2, reducing inference to just 4 steps. PiD applies to both conventional VAE latents and semantic latents (e.g., SigLIP, DINOv2) used in recent RAE-based models. PiD decodes latents of 512 times 512 images into 2048 times 2048 pixels in under 1 second with 13 GB peak memory on a consumer RTX 5090, and as fast as 210 ms on a GB200 GPU, about 6times faster than cascaded diffusion-based super-resolution pipelines with better visual fidelity.