# HABC：面向稀疏回合结果的分层优势加权在线RL微调方法

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-15 08:00
- AIHOT 分数：49
- AIHOT 链接：https://aihot.virxact.com/items/cmqgdr5tw0024slmsim6i13b1
- 原文链接：https://arxiv.org/abs/2606.17043

## AI 摘要

针对预训练VLA策略在线RL微调中回合结果仅含单一成功/失败二元标签的问题，HABC提出分层优势加权方法。它分别训练生存性与效率两个critic head，通过状态自适应门控合并优势，优先保证生存性，仅在成功确定时转向效率，并将合并结果转化为每步权重作用于actor loss。干预感知信用分配进一步限制结果标签于当前策略自主执行片段。在三个接触丰富的双手真实机器人任务上，HABC将成功率从监督微调基线的36%、44%、12%分别提升至92%、88%、38%。

## 正文

When pretrained VLA policies are fine-tuned through online RL, each rollout episode produces only a single binary outcome (success or failure), yet the actor update requires per-transition supervision. Existing approaches commonly reduce this sparse outcome to a single scalar reward or advantage signal, which conflates distinct forms of transition-level feedback and provides limited guidance once basic task success becomes achievable. First, a single scalar signal conflates the two objectives of viability and efficiency; once basic success is achieved, the binary label provides no gradient to distinguish efficient completions from slow ones. Second, real-world rollouts mix autonomous and intervention segments; naively assigning episode outcomes across these boundaries introduces incorrect credit assignment. To address these issues, we propose Hierarchical Advantage-Weighted Behavior Cloning (HABC), which trains separate critic heads for these two objectives on different data subsets and combines their outputs with a state-adaptive balance. A state-adaptive gate g_t merges their one-step advantages, prioritizing viability when success is uncertain and shifting to efficiency only when viability is high, and converts the result into per-transition weights on the actor loss. Intervention-aware credit assignment further restricts outcome labels to segments executed by the current policy, preventing supervision from leaking across intervention boundaries. In real-robot experiments on three contact-rich bimanual tasks, HABC raises success from supervised fine-tuning (SFT) baselines of 36%, 44%, and 12% to 92%, 88%, and 38%.
