Vesta:通用具身推理模型
阅读原文· arxiv.orgVesta是一个统一的具身通用基础模型,将定位、空间推理、导航和长期规划能力整合于单一模型。通过大规模空间感知数据集和简单的多模态记忆机制,Vesta在多种基准测试中平均超过单个SOTA基线20%以上,并优于按类别最佳基线集成的结果10%以上。在需要记忆与推理的真实机器人任务中,Vesta将任务成功率提升35%以上,表明单一通用模型在可行性和可扩展性上优于多模型组合方案。
Robots operating in open-world environments must seamlessly integrate localization, spatial reasoning, navigation, and long-horizon planning. While specialist models excel at individual tasks, deploying a multi-model stack is computationally expensive and prone to cascading errors. We present Vesta, a unified embodied generalist that consolidates these capabilities into a single foundation model. Our approach combines a diverse and massive curated corpus designed to induce spatial grounding and a simple multimodal memory harness that enables reasoning over extended time horizons. Across diverse benchmarks, Vesta on average beats individual SOTA baselines by >20% and beats an ensemble of per-category-best baselines by >10% -- thus demonstrating that a generalist model can match or exceed specialists. On real-world robotic tasks requiring memory and reasoning, Vesta improves task success by >35\%. Our work thus demonstrates that a single generalist is a feasible, scalable, and arguably preferable alternative to combining specialists.