FLUX3D:扩散对齐稀疏表示的高保真3D高斯生成
阅读原文· arxiv.orgFLUX3D提出图像到3D高斯泼溅(3DGS)生成框架,解决两个结构性瓶颈:表示瓶颈(判别式2D特征构建稀疏体素潜在表示抑制重构线索)与跨模态对应瓶颈(标准扩散Transformer难对齐密集2D与稀疏3D token)。引入扩散对齐结构化潜在(DA-SLAT)与仅解码器架构提升3DGS保真度,并设计含稀疏结构多模态扩散Transformer(SMDiT)和模态感知旋转位置编码(MARoPE)的稀疏结构感知扩散框架,实现几何无关对齐。实验表明FLUX3D在外观保真度上显著超越现有SOTA。
Sparse voxel representation has emerged as a scalable foundation for image-to-3D Gaussian Splatting (3DGS) generation, yet current methods struggle to preserve high-frequency visual details of input images due to two structural bottlenecks. First, they adopt discriminative 2D features optimized for semantic abstraction to construct sparse voxel latents, which suppress reconstructive cues and induce a representation bottleneck. Second, in the generation stage, standard diffusion transformers lack effective mechanisms to align dense 2D image tokens with sparse 3D voxel latents, resulting in a cross-modal correspondence bottleneck. To address these issues, we propose FLUX3D, a scalable image-to-3DGS framework that boosts both representation learning and cross-modal alignment during generation. We first revisit 2D feature selection for sparse-voxel-based 3D representation learning, propose Diffusion-Aligned Structured Latents (DA-SLAT) and couple it with a decoder-only architecture to improve 3DGS reconstruction fidelity. We also design a sparse-structure-aware diffusion framework, which integrates the Sparse-structure Multimodal Diffusion Transformer (SMDiT) and Modal-Aware Rotary Positional Embedding (MARoPE) to achieve geometry-agnostic 2D-3D alignment. Extensive benchmark experiments demonstrate that FLUX3D yields substantial improvements in appearance fidelity and significantly outperforms all state-of-the-art (SOTA) methods in generating high-quality 3DGS assets.