Guava:面向具身操作的高效通用框架
阅读原文· arxiv.orgGuava 是一个用于具身工具使用的框架,通过系统探索智能体工作流、动作空间和观察空间,确定了三个关键设计:迭代感知-推理-动作循环、语义动作抽象和多模态观察。研究还开发了端到端训练流程,将具身操作能力蒸馏至一个 4B 开源模型,仅用少于 2K 条模拟轨迹。仿真与真实实验表明,Guava 性能接近前沿专有模型,对未见物体、新指令和长时任务有强泛化能力。结果表明,精心设计的框架可作为模型无关的具身操作接口,以极少数据为紧凑开源模型带来涌现能力。
Language models trained on large-scale vision-language data have demonstrated strong potential for embodied agents. Harnessing models through embodied tools use offers a promising alternative to end-to-end vision-language-action systems by combining high-level reasoning with external modules for perception, planning, and control. However, it remains unclear what makes an effective harness for embodied manipulation, and to what extent such a harness can unlock embodied capabilities in a wide range of reasoning models. In this work, we present Guava, a harness framework for embodied tool use developed through systematic exploration of the design space of agent workflows, action spaces, and observation spaces. Our study identifies three key ingredients for effective embodied agents: iterative perception-reasoning-action loops, semantic action abstractions, and multimodal observations. To understand whether these design principles are universal even to small models, we develop an end-to-end training pipeline that distills embodied manipulation capabilities into a 4B open-source model using fewer than 2K trajectories collected entirely in simulation. Experimental results in both simulation and real-world environments show performance comparable to frontier proprietary models while exhibiting strong generalization to unseen objects, novel instructions, and long-horizon tasks. Results suggest that a well-designed harness can serve as a scalable, model-agnostic interface for embodied manipulation, enabling strong emergent embodied capabilities in compact open-source models with minimal training data.