迈向原生多模态建模:一份路线图
阅读原文· arxiv.org本文提出了从多模态无关推理迈向世界建模的路径,聚焦从后期融合范式转向原生多模态建模(NMM)。研究正式定义了架构的原生性,将中期融合与早期融合从非原生范式中区分,并依据输入输出对偶性将现有原生模型分为三类:用于跨模态理解的“多模态输入至文本输出”、面向特定场景生成的“多模态输入至目标输出”,以及统一建模的“多模态输入至多模态输出”。文章系统性地探讨了向最终原生多模态建模框架的工业级转型路径,涵盖架构协调、大规模数据构建、全栈训练方案、推理部署及综合评估。
Multimodal modeling represents a vital step from modality-agnostic reasoning toward world modeling. While early approaches predominantly rely on late-fusion that assembles encoders and frozen language backbones with output heads, recent efforts have shifted the paradigm toward native multimodal modeling (NMM) with the intrinsic integration of modalities for superior multimodal performance. Despite its potential, the design space of native architectures remains insufficiently defined. In this paper, we present the community with a formalized roadmap for this transition. Specifically, we formally define the architectural nativity, distinguishing mid-fusion and early-fusion from non-native paradigms. We further organize the existing native models through the lens of input-output duality into three categories: (i) Multi-to-Text for cross-modal comprehension with text-only output; (ii) Multi-to-Target for scenario-oriented generation, e.g., image, audio and video generation, and (iii) Multi-to-Multi for unified modeling with symmetric input-output. We deliver a comprehensive and industrial-grade investigation into the transition toward the definitive NMM framework, where understanding and generation seamlessly coexist within a unified transformer paradigm. We systematically unpack the end-to-end pipeline from industrial perspectives from architectural coordination, massive data curation, to full-stack training recipes, inference & deployment, and the comprehensive evaluation for truly native modeling.