# 多步工具使用的强化学习为何崩溃以及监督信号如何修复它

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-24 08:00
- AIHOT 分数：56
- AIHOT 链接：https://aihot.virxact.com/items/cmqwqyy1j006sslika3qx280h
- 原文链接：https://arxiv.org/abs/2606.26027

## AI 摘要

大语言模型在多步工具使用的强化学习（RL）训练中常出现灾难性崩溃——模型性能骤降且工具调用结构失效。研究发现，崩溃源于特定控制 token 的概率尖峰，但底层工具使用能力并未丢失，仅被格式掩盖。研究者系统探索了离策略监督、提示引导、错误示例等多种监督信号，发现将监督微调（SFT）与 RL 交错训练可显著提升稳定性，但在格式和内容分布外（OOD）评估中性能下降。代码已开源。

## 正文

Tool use enables large language models (LLMs) to perform complex tasks, and recent agentic reinforcement learning (RL) methods show promise for enhancing model capabilities. However, RL alone often leads to instability or limited gains in tool-use tasks. In our experiments, some models exhibit catastrophic collapse, where performance abruptly drops and tool-invocation structures fail. The analysis reveals that these failures stem from unexpected probability spikes in specific control tokens, disrupting structured execution, yet the underlying tool-use capability remains intact, merely obscured by specific formats. To address this, we systematically investigate a diverse set of supervisory signals, including off-policy supervision, hint-based guidance, erroneous example supervision, and others, applied under both synchronous and interleaved training schemes. We find that interleaving supervised fine-tuning (SFT) with RL substantially improves stability, but exhibits degraded performance under format and content out-of-distribution (OOD) evaluation. We also analyze the impact of learning rates and generalization across settings. These results highlight the importance of understanding RL failures and demonstrate how diverse supervisory signals can guide exploratory learning, enabling robust training of LLMs for complex, multi-step tool-use tasks. Our Code is available at https://github.com/hypasd-art/Tool-RL-Box.