VoLo：面向开放词汇长时程操作的物理编排器

2026-06-05 08:00·28天前

AI 摘要

VoLoAgent是一个基于VLM的物理编排智能体，将异构机器人能力（VLA/WAM、视觉模型、动作原语）作为可中断工具，实现规划、监控与恢复。同时提出RoboVoLo基准，专为开放词汇长时程操作设计，涵盖常识、记忆/状态跟踪、复杂指代与世界知识，并提供任务级成功率和失败诊断。实验表明VoLoAgent显著优于单VLA/VLM或基于工具的系统，并在真实机器人上验证。

原文 · 未翻译

Open-vocabulary long-horizon manipulation requires robots to reason over flexible instructions and complex multi-object scenes while adaptively planning, executing, monitoring, and recovering from failures. We address these demands with a closed agent loop in which a VLM orchestrates heterogeneous robot capabilities as interruptible tools. Unlike in virtual AI agents, the timing of decisions, actions and tool calls is important in a physical world that does not pause for reasoning. We refer to this setting as Physical Orchestration, and propose VoLoAgent, a VLM that plans, monitors, and recovers by treating a VLA/WAM as an interruptible tool it steers mid-rollout alongside vision models and action primitives. To evaluate these long-horizon capabilities, we introduce RoboVoLo, a high-fidelity benchmark for open-vocabulary long-horizon manipulation across common sense, memory/state tracking, complex references, and world knowledge, with both task-level success and failure-mode diagnostics. Experiments show VoLoAgent substantially outperforms single VLA/VLM or tool-based systems, with validation on real-robot experiments. Project page: https://chicychen.github.io/VoLo/

HuggingFace Daily Papers（社区热门论文）

55导出 Markdown

VoLo：面向开放词汇长时程操作的物理编排器

2026-06-05 08:00·28天前

阅读原文· arxiv.org

AI 摘要

原文 · 保持原样，未翻译