# PaW：策略与世界模型协同训练框架

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-01 08:00
- AIHOT 分数：60
- AIHOT 链接：https://aihot.virxact.com/items/cmpwt12lv00fpsl7945f1rdyk
- 原文链接：https://arxiv.org/abs/2606.02388

## AI 摘要

提出PaW框架，通过协同训练策略与世界模型来提升语言智能体性能。该方法直接利用on-policy强化学习rollout中已有的信号（动作与后续观测的配对），无需额外模拟器、训练阶段或推理计算。PaW引入三个组件：基于动作熵的世界模型数据选择、噪声容忍的损失函数以及奖励自适应的损失平衡，以确保辅助监督的稳定性。实验表明，在三个智能体任务基准上，PaW在不同模型和RL算法上均持续优于强RL基线，证实了标准RL rollout可作为世界模型监督的实用来源。

## 正文

Reinforcement learning (RL) improves large language model (LLM) agents by teaching them which actions lead to high rewards, but provides little supervision on what those actions do to the environment. World modeling (WM) can fill this gap, yet existing approaches often require separate simulators, extra training stages, or additional inference-time computation. We observe that on-policy RL rollouts already contain the needed signal: each transition pairs an action with its resulting next observation. Based on this observation, we propose PaW, a Policy and World modeling co-training framework that adds auxiliary WM supervision to the same policy during RL, without changing the inference paradigm. To make auxiliary WM supervision informative and stable, PaW introduces three components: action-entropy-based WM data selection, noise-tolerant WM loss, and reward-adaptive loss balancing. Experiments on three agentic task benchmarks show consistent improvements over strong RL baselines across models and RL algorithms. These results suggest that standard RL rollouts are a practical source of WM supervision for language-agent training.
