# Retrieve， Don't Retrain：测试时检索扩展VLA模型到新任务

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-14 08:00
- AIHOT 分数：45
- AIHOT 链接：https://aihot.virxact.com/items/cmqg30by7032aslspmtq24wzn
- 原文链接：https://arxiv.org/abs/2606.15631

## AI 摘要

提出检索增强的视觉-语言-动作（VLA）策略，训练一次后冻结，新任务通过在检索池中追加演示数据来适应，无需逐任务微调。在基于视频生成的世界动作模型（WAM）Cosmos Policy上效果尤其显著，检索提供粗粒度任务推进，未来图像目标补充视觉一致性信号。在PushT和RoboTwin 2.0上超越跨体现基线，并在真实机器人上完成验证。

## 正文

Extending a vision-language-action (VLA) policy to a new task typically requires task-specific teleoperated demonstrations and per-task fine-tuning, making adaptation costly in both data collection and compute. In this paper, we show that this target-side per-task adaptation cost can be replaced by retrieval. Our retrieval-augmented policy is trained once on paired demonstrations from the target embodiment (query) and a cheaper embodiment (pool, e.g., human-hand video), then frozen. New tasks are added at deployment by appending pool-side demonstrations to a retrieval pool. The frozen policy conditions on retrieved trajectories at every control step, so new tasks are absorbed by indexing data rather than updating parameters. Fine-tuning is needed only to take on a new, unseen embodiment, not for each new task. We show that retrieval improves policies beyond a specific backbone, including standard VLA policies, but its effect is especially pronounced in Cosmos Policy, a video-generation-based world-action model (WAM). In this setting, retrieval supplies coarse task progression, while the WAM's future-image objective provides an additional visual consistency signal that strengthens the retrieval-conditioned actions. On PushT, we study how retrieval provides a reusable high-level motion prior for cross-embodiment generalization to unseen goal angles, while on RoboTwin 2.0 our method outperforms cross-embodiment baselines on unseen tasks, and we additionally demonstrate the method on a real robot.
