论文提出PlanBench-XL基准,包含327个任务和1,665个工具,测试LLM智能体在工具难以发现时完成长程工具使用任务的能力。GPT-5.4常规准确率为51.90%,最困难的blocked设置降至11.36%。核心思路是让智能体同时从已知向前推理和从需求向后推理,而非依赖显式工具路径。论文还加入破损或误导性工具,考验智能体在路径失败时自主切换策略。
This paper shows that LLM agents still struggle to plan through big, messy tool libraries.
The paper builds a retail benchmark PlanBench-XL, to test whether LLM agents can solve long tool-use tasks when tools are hard to find.
With 327 tasks and 1,665 tools, where agents must uncover hidden intermediate facts before they can answer.
Even strong models struggle, with GPT-5.4 getting 51.90% accuracy normally and dropping to 11.36% in the hardest blocked setting.
The problem is that real agents often face huge tool libraries, so they cannot see every tool at once and must search for useful ones while solving the task.
The core idea is to make agents plan both forward from what they know and backward from what they need, instead of giving them a clear tool path.