# 多智能体强化学习何时改善LLM工作流程：工作流程、规模与策略共享的权衡

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-05-22 08:00
- AIHOT 分数：49
- AIHOT 链接：https://aihot.virxact.com/items/cmpw5g4y2007jsl1uqzrpxcav
- 原文链接：https://arxiv.org/abs/2605.24202

## AI 摘要

本文研究了端到端强化学习训练多智能体大语言模型工作流程的两种策略：Shared-Policy（所有角色更新同一策略）和Isolated-Policy（每个角色拥有独立参数）。实验矩阵覆盖了Eval-Opt、Voting和Orch-Workers三种工作流程，数学与代码任务，以及0.6B、1.7B、4B三种模型规模。研究发现，多智能体强化学习通常能改善基模型性能，但增益取决于工作流程、任务和模型规模的共同作用。Isolated-Policy往往能达到更高峰值准确率，但更容易出现性能悬崖；Shared-Policy训练则会将失败模式重新分配为不同的模式。策略共享并非提供均匀稳定性，而是在不同渠道分配训练压力，是一种具有工作流程和任务条件性权衡的设计选择。

## 正文

Multi-agent LLM workflows route inference through specialized roles to lift end-task accuracy, but jointly training those roles with reinforcement learning is unstable in ways that are poorly understood. We study when end-to-end RL training of multi-agent LLM workflows improves over their base models, comparing Shared-Policy training, where all roles update one policy, with Isolated-Policy training, where each role has its own parameters. Our experimental matrix spans Eval-Opt, Voting, and Orch-Workers workflows, math and code tasks, and three model scales (0.6B, 1.7B, 4B). We find that multi-agent RL usually improves over base models, but gains depend jointly on workflow, task, and scale, not on policy sharing alone. Isolated-Policy tends to reach higher peak accuracy yet more often falls off a terminal accuracy cliff, while Shared-Policy training does not eliminate failure; it redistributes failure into qualitatively different patterns. We then explain the strongest of these patterns through role-level gradient dynamics induced by workflow topology and policy routing: under Isolated-Policy, parallel same-role agents on shared prompts amplify per-role gradients and drive terminal degradation in Voting and Orch-Workers workflows; under Shared-Policy, asymmetric per-step gradient mass causes the shared policy to be captured by the dominant role, producing different failure signatures by task and workflow. Together, the empirical map and its underlying mechanisms show that policy sharing routes training pressure through different channels rather than offering uniform stability, making it a design choice with workflow- and task-conditional tradeoffs.