TRIAGE:智能体强化学习的角色类型化信用分配框架
阅读原文· arxiv.orgTRIAGE 提出角色类型化信用分配框架,替代标准 GRPO 的均匀优势信号。结构化判断器将每个智能体片段分类为决定性进展、有用探索、无进展基础设施或回归,并映射为固定角色条件规则下的过程奖励,修正纯结果信用对失败轨迹中有用探索的惩罚和对成功轨迹中冗余/倒退动作的强化。在 ALFWorld、Search-QA 和 WebShop 上,TRIAGE 提升成功率,优于标量判断器过程奖励和结果监督共享主干价值基线。消融实验表明收益来自角色类型化,成功轨迹内的回归检测是主要贡献,探索信用提供二次增益;在完整轨迹上,TRIAGE 分别减少 10.4% 和 14.8% 的环境交互轮数。
Agentic reinforcement learning requires assigning credit to environment-facing actions such as searches, clicks, edits, navigation commands, and object interactions. Standard GRPO uses the final verifier outcome as a uniform advantage over all action tokens. This outcome signal is useful but structurally incomplete: it punishes useful exploration in failed rollouts and reinforces redundant or regressive actions in successful rollouts. We propose TRIAGE, a role-typed credit assignment framework that adds a semantic role axis to outcome credit. A structured judge classifies each segment as decisive progress, useful exploration, no-progress infrastructure, or regression, and a fixed role-conditioned rule maps these labels to bounded segment-level process rewards. This keeps verifier outcomes as the source of optimization direction while correcting the two main blind spots of outcome-only credit. We further show that role-conditioned credit is the optimal segment-level correction expressible from role labels alone -- a projection of the per-segment advantage residual onto the role variable -- so that the fixed role constants reduce advantage estimation error whenever the judge is reliable, and we connect this to lower-variance policy gradients. Across ALFWorld, Search-QA, and WebShop, TRIAGE improves success rates over GRPO for two policy models and outperforms both a scalar judge-derived process reward and an outcome-supervised shared-backbone value baseline. Ablations show that the gain comes from role typing rather than merely adding dense rewards: reliable detection of regression inside successful trajectories is the dominant contributor, while exploration credit provides a consistent secondary gain; on completed ALFWorld and WebShop rollouts, TRIAGE also reduces environment-facing turns by an additional 10.4% and 14.8% relative to GRPO.