# VoLo：面向开放词汇长时程操作的物理编排器

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-05 08:00
- AIHOT 分数：55
- AIHOT 链接：https://aihot.virxact.com/items/cmq7cy4rn01yxsl5wqh3ta04c
- 原文链接：https://arxiv.org/abs/2606.07723

## AI 摘要

VoLoAgent是一个基于VLM的物理编排智能体，将异构机器人能力（VLA/WAM、视觉模型、动作原语）作为可中断工具，实现规划、监控与恢复。同时提出RoboVoLo基准，专为开放词汇长时程操作设计，涵盖常识、记忆/状态跟踪、复杂指代与世界知识，并提供任务级成功率和失败诊断。实验表明VoLoAgent显著优于单VLA/VLM或基于工具的系统，并在真实机器人上验证。

## 正文

Open-vocabulary long-horizon manipulation requires robots to reason over flexible instructions and complex multi-object scenes while adaptively planning, executing, monitoring, and recovering from failures. We address these demands with a closed agent loop in which a VLM orchestrates heterogeneous robot capabilities as interruptible tools. Unlike in virtual AI agents, the timing of decisions, actions and tool calls is important in a physical world that does not pause for reasoning. We refer to this setting as Physical Orchestration, and propose VoLoAgent, a VLM that plans, monitors, and recovers by treating a VLA/WAM as an interruptible tool it steers mid-rollout alongside vision models and action primitives. To evaluate these long-horizon capabilities, we introduce RoboVoLo, a high-fidelity benchmark for open-vocabulary long-horizon manipulation across common sense, memory/state tracking, complex references, and world knowledge, with both task-level success and failure-mode diagnostics. Experiments show VoLoAgent substantially outperforms single VLA/VLM or tool-based systems, with validation on real-robot experiments. Project page: https://chicychen.github.io/VoLo/