StepPO:面向智能体强化学习的步骤对齐策略优化
阅读原文· arxiv.org现有大语言模型强化学习算法以模型 token 为基本优化单元,在智能体场景中存在粒度不匹配。StepPO 提出以 step 为中心的新范式,将 token 级马尔可夫决策过程重建模为 step 级 MDP,并引入 step 级信用分配,使策略优化对齐智能体决策的自然粒度。在多跳问答、学术论文搜索和文本世界动作任务中,StepPO 一致优于多种 RL 算法,为训练更强大智能体提供了实用路径。
Agentic reinforcement learning (RL) is emerging as a critical post-training paradigm for improving LLM agent capabilities. Existing RL algorithms for LLMs largely follow the token-centric paradigm as in RLHF and RLVR, where tokens serve as the basic units for modeling and optimization. However, this paradigm introduces a granularity mismatch in agentic RL, as it optimizes token-level predictions while LLM agents make step-level decisions through cycles of environmental observations and actions. To bridge this gap, we propose StepPO, a step-centric paradigm for agentic RL via step-aligned policy optimization. Specifically, we reformulate agentic RL from a token-level Markov Decision Process (MDP) into a step-level MDP, where interaction steps serve as the basic trajectory representations. We further propose step-level credit assignment to align policy optimization with the natural granularity of agent decisions. Together, StepPO optimizes agent policies at the step level for multi-turn agent-environment interaction. Experiments across multi-hop QA, academic paper search, and text-world action tasks show that StepPO consistently outperforms various RL algorithms. Further analyses provide insights into how step-centric paradigm improves agent training. We hope this step-centric paradigm offers a useful lens for understanding agent behavior and a practical path for training more capable LLM agents.