ToolMaze：LLM智能体在工具失效时的动态重规划与异常恢复基准测试

2026-06-04 08:00·29天前

AI 摘要

ToolMaze是一个评估LLM智能体在工具失效场景下动态路径发现与错误恢复能力的基准测试。其采用双维度设计：DAG拓扑复杂度与2×2工具扰动分类（显式/隐式、瞬时/永久）。评估显示，几乎所有模型在扰动下性能均下降，隐式语义失效导致扰动恢复率（PRR）骤降约37%，复杂拓扑则使智能体陷入无效试错循环。关键发现：智能体容错能力随模型规模提升的速度比基本任务执行慢3.66倍，动态重规划成为模型扩展无法解决的独立瓶颈。数据和代码已公开。

原文 · 未翻译

Existing benchmarks evaluate Tool-Integrated Reasoning (TIR) in LLMs on idealized ''happy paths'', largely overlooking real-world tool failures. We introduce ToolMaze, a benchmark for dynamic path discovery and error recovery in TIR agents. To separate systematic replanning from blind trial-and-error, ToolMaze adopts a two-dimensional design: DAG-based topological complexity and a 2 times 2 taxonomy of tool perturbations (explicit/implicit, transient/permanent). Evaluations show that perturbations degrade performance across nearly all models, with the sharpest drops under implicit semantic failures. Driven by systemic over-trust in corrupted outputs, Perturbation Recovery Rate (PRR) plummets by around 37\% in these scenarios, while complex topologies trap agents in futile trial-and-error loops. Crucially, agentic fault-tolerance improves with model scale 3.66times slower than basic task execution, highlighting dynamic replanning as a distinct bottleneck unaddressed by model scaling or prompting. Data and code are available at https://github.com/Zhudongsheng75/ToolMaze.

HuggingFace Daily Papers（社区热门论文）

58导出 Markdown

ToolMaze：LLM智能体在工具失效时的动态重规划与异常恢复基准测试

2026-06-04 08:00·29天前

阅读原文· arxiv.org

AI 摘要

原文 · 保持原样，未翻译