OneVL:基于视觉语言解释的单步隐式推理与规划
阅读原文· arxiv.org针对自动驾驶中Chain-of-Thought推理的高延迟问题,本文提出OneVL框架,统一VLA与世界模型。该方法通过双辅助解码器(语言解码器重建文本推理链,视觉世界模型解码器预测未来帧)监督紧凑隐式token,使模型内化道路几何与动态环境的因果规律。三阶段训练逐步对齐轨迹、语言与视觉目标,推理时丢弃解码器实现单步并行计算。在四项基准测试中,OneVL成为首个超越显式CoT的隐式推理方法,以answer-only延迟达到SOTA精度。
Chain-of-Thought (CoT) reasoning has become a powerful driver of trajectory prediction in VLA-based autonomous driving, yet its autoregressive nature imposes a latency cost that is prohibitive for real-time deployment. Latent CoT methods attempt to close this gap by compressing reasoning into continuous hidden states, but consistently fall short of their explicit counterparts. We suggest that this is due to purely linguistic latent representations compressing a symbolic abstraction of the world, rather than the causal dynamics that actually govern driving. Thus, we present OneVL (One-step latent reasoning and planning with Vision-Language explanations), a unified VLA and World Model framework that routes reasoning through compact latent tokens supervised by dual auxiliary decoders. Alongside a language decoder that reconstructs text CoT, we introduce a visual world model decoder that predicts future-frame tokens, forcing the latent space to internalize the causal dynamics of road geometry, agent motion, and environmental change. A three-stage training pipeline progressively aligns these latents with trajectory, language, and visual objectives, ensuring stable joint optimization. At inference, the auxiliary decoders are discarded and all latent tokens are prefilled in a single parallel pass, matching the speed of answer-only prediction. Across four benchmarks, OneVL becomes the first latent CoT method to surpass explicit CoT, delivering state-of-the-art accuracy at answer-only latency, and providing direct evidence that tighter compression, when guided in both language and world-model supervision, produces more generalizable representations than verbose token-by-token reasoning. Project Page: https://xiaomi-embodied-intelligence.github.io/OneVL