注入视觉概念:在推理时向文本条件扩散模型注入图像引导
阅读原文· arxiv.orgVisual Concept Fusion (VCF) 是首个在推理阶段同时接受图像和文本提示、无需概念特定训练的方法。它通过将 CLIP 图像特征对齐到文本嵌入空间,实现视觉概念注入。VCF 包含一个轻量级对齐器、一种融合策略以及可选的提示噪声优化(PNO)模块。实验表明,VCF 能从参考图像转移风格、构图和调色板等视觉属性,同时遵循文本提示。定量结果显示,其文本对齐度(CLIP 分数)与视觉相似度(LPIPS)之间存在权衡,但在参考保真度上优于基线方法。
Text-to-image diffusion models like Stable Diffusion generate high-quality images from text, but lack a way to inject visual guidance (e.g. sketches, styles) at inference without retraining. Existing methods either require computationally expensive fine-tuning or rely on style transfer techniques that risk semantic misalignment with textual prompts. We introduce Visual Concept Fusion (VCF), the first method offering dual conditioning on both an image and text prompt at inference time without any concept-specific training. VCF enables visual concept injection into Stable Diffusion by aligning CLIP image features with the text embedding space. VCF consists of three components: (1) a lightweight aligner that maps image tokens to the text embedding manifold using InfoNCE and cross-attention reconstruction losses, (2) a fusion strategy that preserves both textual and visual semantics, and (3) an optional Prompt-Noise Optimization (PNO) module for test-time refinement. Our experiments demonstrate that VCF successfully transfers visual attributes including style, composition, and color palette from reference images while maintaining prompt adherence. Quantitative results show a trade-off between text alignment (CLIP score) and visual correspondence (LPIPS), with VCF outperforming baselines in reference fidelity.