TRIAGE：智能体强化学习的角色类型化信用分配框架

2026-06-30 08:00·2天前

AI 摘要

TRIAGE 提出角色类型化信用分配框架，替代标准 GRPO 的均匀优势信号。结构化判断器将每个智能体片段分类为决定性进展、有用探索、无进展基础设施或回归，并映射为固定角色条件规则下的过程奖励，修正纯结果信用对失败轨迹中有用探索的惩罚和对成功轨迹中冗余/倒退动作的强化。在 ALFWorld、Search-QA 和 WebShop 上，TRIAGE 提升成功率，优于标量判断器过程奖励和结果监督共享主干价值基线。消融实验表明收益来自角色类型化，成功轨迹内的回归检测是主要贡献，探索信用提供二次增益；在完整轨迹上，TRIAGE 分别减少 10.4% 和 14.8% 的环境交互轮数。

原文 · 未翻译

Agentic reinforcement learning requires assigning credit to environment-facing actions such as searches, clicks, edits, navigation commands, and object interactions. Standard GRPO uses the final verifier outcome as a uniform advantage over all action tokens. This outcome signal is useful but structurally incomplete: it punishes useful exploration in failed rollouts and reinforces redundant or regressive actions in successful rollouts. We propose TRIAGE, a role-typed credit assignment framework that adds a semantic role axis to outcome credit. A structured judge classifies each segment as decisive progress, useful exploration, no-progress infrastructure, or regression, and a fixed role-conditioned rule maps these labels to bounded segment-level process rewards. This keeps verifier outcomes as the source of optimization direction while correcting the two main blind spots of outcome-only credit. We further show that role-conditioned credit is the optimal segment-level correction expressible from role labels alone -- a projection of the per-segment advantage residual onto the role variable -- so that the fixed role constants reduce advantage estimation error whenever the judge is reliable, and we connect this to lower-variance policy gradients. Across ALFWorld, Search-QA, and WebShop, TRIAGE improves success rates over GRPO for two policy models and outperforms both a scalar judge-derived process reward and an outcome-supervised shared-backbone value baseline. Ablations show that the gain comes from role typing rather than merely adding dense rewards: reliable detection of regression inside successful trajectories is the dominant contributor, while exploration credit provides a consistent secondary gain; on completed ALFWorld and WebShop rollouts, TRIAGE also reduces environment-facing turns by an additional 10.4% and 14.8% relative to GRPO.

HuggingFace Daily Papers（社区热门论文）

35导出 Markdown

TRIAGE：智能体强化学习的角色类型化信用分配框架

2026-06-30 08:00·2天前

阅读原文· arxiv.org

AI 摘要

原文 · 保持原样，未翻译