# 面向对象中心残差RL的零样本仿真到真实VLA增强

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-17 08:00
- AIHOT 分数：48
- AIHOT 链接：https://aihot.virxact.com/items/cmqyw5b4q004yslhrjzojtxak
- 原文链接：https://arxiv.org/abs/2606.18953

## AI 摘要

针对VLA模型在精确物理交互中因模仿学习执行误差累积而脆弱的问题，提出基于对象姿态的残差强化学习框架。该方法利用物体姿态精化VLA动作，实现紧凑观测空间在仿真与真实间一致迁移。残差RL策略仅在仿真中训练（加入姿态噪声注入和dropout），零样本迁移到真实Franka Research 3机器人。在五项操作任务中，成功率从42%零样本提升至76%，且改进轨迹可进一步用于重训基础VLA，无需额外遥操作即可实现自改进。

## 正文

Vision-Language-Action (VLA) models can generalize across diverse manipulation tasks, but their imitation-learning-based policies remain brittle in precise physical interactions due to compounding execution errors; Can a reinforcement learning policy trained purely in simulation improve the robustness of real-world VLAs zero-shot? Residual RL, which learns a corrective policy on top of a frozen VLA, offers a natural framework, but existing approaches face a fundamental sim-to-real dilemma: privileged-state methods require lossy distillation for deployment; image-based methods suffer from the visual domain gap; and real-world RL is costly and unsafe. We propose an object-centric residual RL framework that refines VLA actions using object poses, enabling a compact observation space that transfers consistently between simulation and reality. To align the two domains, we additionally replay the same teleoperation demonstrations in simulation to train a sim counterpart of the real-world VLA. The residual RL policy is trained only in simulation with pose noise injection and dropout, and transfers zero-shot to the real robot. Across five manipulation tasks on a real Franka Research 3 (FR3) robot, our method improves the success rate from 42% to 76% zero-shot, and the improved rollouts can be further reused to retrain the base VLA for self-improvement without additional teleoperation. Project page: https://www.microsoft.com/en-us/research/articles/object-centric-residual-rl/
