StepPO：面向智能体强化学习的步骤对齐策略优化

2026-06-05 08:00·28天前

AI 摘要

现有大语言模型强化学习算法以模型 token 为基本优化单元，在智能体场景中存在粒度不匹配。StepPO 提出以 step 为中心的新范式，将 token 级马尔可夫决策过程重建模为 step 级 MDP，并引入 step 级信用分配，使策略优化对齐智能体决策的自然粒度。在多跳问答、学术论文搜索和文本世界动作任务中，StepPO 一致优于多种 RL 算法，为训练更强大智能体提供了实用路径。

原文 · 未翻译

Agentic reinforcement learning (RL) is emerging as a critical post-training paradigm for improving LLM agent capabilities. Existing RL algorithms for LLMs largely follow the token-centric paradigm as in RLHF and RLVR, where tokens serve as the basic units for modeling and optimization. However, this paradigm introduces a granularity mismatch in agentic RL, as it optimizes token-level predictions while LLM agents make step-level decisions through cycles of environmental observations and actions. To bridge this gap, we propose StepPO, a step-centric paradigm for agentic RL via step-aligned policy optimization. Specifically, we reformulate agentic RL from a token-level Markov Decision Process (MDP) into a step-level MDP, where interaction steps serve as the basic trajectory representations. We further propose step-level credit assignment to align policy optimization with the natural granularity of agent decisions. Together, StepPO optimizes agent policies at the step level for multi-turn agent-environment interaction. Experiments across multi-hop QA, academic paper search, and text-world action tasks show that StepPO consistently outperforms various RL algorithms. Further analyses provide insights into how step-centric paradigm improves agent training. We hope this step-centric paradigm offers a useful lens for understanding agent behavior and a practical path for training more capable LLM agents.

HuggingFace Daily Papers（社区热门论文）

37导出 Markdown

StepPO：面向智能体强化学习的步骤对齐策略优化

2026-06-05 08:00·28天前

阅读原文· arxiv.org

AI 摘要

原文 · 保持原样，未翻译