# AsyncWebRL：面向视觉Web智能体的高效多步强化学习

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-04 08:00
- AIHOT 分数：47
- AIHOT 链接：https://aihot.virxact.com/items/cmq7033yc017zslbhfffonr3p
- 原文链接：https://arxiv.org/abs/2606.05597

## AI 摘要

AsyncWebRL采用异步系统设计，重叠rollout、梯度更新与策略刷新，并引入永久rollout池和轻量截图处理，比此前最快开源同步流程WebGym实现最高2.9倍端到端训练吞吐加速。算法方面将多步GRPO中每轨迹归一化因子1/|τ_i|替换为常数1/k，解除了失败轨迹对梯度权重的耦合，压缩轨迹长度。在WebGym分布外测试集上创下新开源SOTA（相对+5.8%），Medium子集+42%，Hard子集+48%。

## 正文

Training vision-language web agents with multi-step RL is compute-intensive, with two dominant forms of inefficiency: idle GPUs in synchronous RL, and trajectories that use more steps and tokens than necessary. We present AsyncWebRL, which addresses both. On the system side, an asynchronous design overlaps rollout, gradient update, and policy refresh across iterations, paired with two web-agent-specific adaptations, namely an everlasting rollout pool and lightweight screenshot handling, that together deliver up to a 2.9times end-to-end training-throughput speedup over the previously fastest open synchronous pipeline (WebGym). On the algorithmic side, we identify the per-trajectory normalizer 1/|τ_i| in multi-step GRPO as the root cause of trajectory-level and token-level inefficiency: because failures are systematically longer than successes, it down-weights the negative gradient on failed tokens, so the policy keeps producing verbose memory schemas. Replacing 1/|τ_i| with a constant 1/k breaks this coupling, contracting trajectories while preserving aggregate success. Together, these contributions set a new open-source state of the art on the WebGym out-of-distribution test split (+5.8% relative over the 42.9% prior best), with the largest gains on the harder slices (+42% relative on Medium, +48% relative on Hard).
