# DecQ：用于增强表征自编码器重建与生成质量的细节凝练查询

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-05-21 08:00
- AIHOT 分数：58
- AIHOT 链接：https://aihot.virxact.com/items/cmpgrlwgu0i1bsljwf75t8h2c
- 原文链接：https://arxiv.org/abs/2605.22777

## AI 摘要

表征自编码器（RAEs）使用冻结的视觉模型作为编码器，这在提供高质量生成的同时，限制了其空间重建能力。针对微调能改善重建但会损害生成质量这一权衡难题，本文提出了DecQ框架。该框架引入轻量级的“细节凝练查询”模块，从视觉模型的中间层提取细粒度信息，并将其融合到解码器和生成过程中。实验表明，仅增加8个查询和3.9%的计算量，DecQ就能将基于DINOv2的表征自编码器的峰值信噪比从19.13 dB显著提升至22.76 dB；在生成任务上，其收敛速度比原始框架快3.3倍，FID分数在无引导和有引导下分别达到1.41和1.05，有效兼顾了重建与生成性能。

## 正文

Representation Autoencoders (RAEs) leverage frozen vision foundation models (VFMs) as tokenizer encoders, providing robust high-level representations that facilitate fast convergence and high-quality generation in latent diffusion models. However, freezing the VFM inherently constrains its spatial reconstruction capacity, limiting fine-grained generation and image editing; in contrast, incorporating reconstruction-oriented signals via fine-tuning disrupts the pretrained semantic space and degrades generative fidelity. To address this trade-off, we propose DecQ, a simple yet effective framework for RAEs. Specifically, DecQ introduces lightweight detail-condensing queries that extract fine-grained information from intermediate VFM features through condenser modules. These queries are incorporated into the decoder to support reconstruction and are jointly generated with patch tokens during generative modeling. By aggregating information from both shallow and deep layers, DecQ effectively mitigates the reconstruction--generation trade-off, improving both reconstruction quality and generative performance. Our experiments demonstrate that: (1) with only 8 additional queries and 3.9% extra computation, DecQ improves reconstruction over the frozen DINOv2-based RAE, increasing PSNR from 19.13 dB to 22.76 dB; and (2) for generative modeling, DecQ achieves 3.3times faster convergence than RAE, attaining an FID of 1.41 without guidance and 1.05 with guidance.