从像素到词语--迈向规模化原生One-Vision模型
阅读原文· arxiv.org提出一种名为NEO-ov的原生视觉语言基础模型,它能够端到端地学习跨帧和像素-词语的对应关系,无需任何外部图像编码器、辅助适配器或后处理融合。该架构完全消除了模块边界,使得精细、统一的时空建模能力在模型内部原生涌现。研究表明,NEO-ov在精细视觉感知任务上表现优异,大幅缩小了与模块化模型的性能差距,验证了原生One-Vision架构在规模化下的可行性。代码与模型已开源。
Current vision-language models (VLMs) typically stitch together separate image encoders and language decoders via multi-stage alignment, a modular framework that inevitably fragments pixel-level signals across frames and scatters early pixel-word interactions. In parallel, native VLMs, despite impressive performance on single images, remain largely unexplored in multi-image, video understanding, and spatial intelligence. Hence, we introduce NEO-ov, a native foundation model that learns cross-frame and pixel-word correspondence end-to-end, without any external encoders, auxiliary adapters, or post-hoc fusion. By eliminating module boundaries entirely, NEO-ov enables fine-grained and unified spatiotemporal modeling to emerge natively inside the model. Notably, NEO-ov largely narrows the gap to modular counterparts while excelling at fine-grained visual perception, validating that native "one-vision" architectures are not only feasible but competitive at scale. Beyond empirical performance, we unveil systematic architectural analyses and detailed training recipes to facilitate subsequent native multimodal modeling. Our code and models are publicly available at: https://github.com/EvolvingLMMs-Lab/NEO.