该综述梳理了专注大语言模型的智能体强化学习,涵盖500余篇工作,按能力与应用两维度归类。指出传统LLM训练仅对单次答案给予单次奖励,无法处理真实任务中的多步决策、部分信息与延迟反馈。智能体学习框架包含:记忆跟踪上下文、规划选取动作序列、工具影响环境,并整合推理处理约束、感知多模态输入、自我改进优化策略。强化学习串联所有环节——奖励在序列结束时到达,策略借此学习下一步行动。
Nice survey paper mapping agentic reinforcement learning for LLMs, showing how models learn by acting across time.
Covers 500+ works and groups them into a 2-part map of capabilities and applications.
The problem is that common LLM training rewards a single answer once, then stops learning.
Real tasks need many steps, partial information, and choices that affect what happens later.
The survey formalizes that setup as an agent that sees a bit, chooses an action, and gets feedback.
That perspective uses memory to track context, planning to pick sequences, and tools to affect the world.
It also includes reasoning for constraint handling, perception for multimodal inputs, and self-improvement to refine policies.