# ToolSense：审计LLM中参数化工具知识的诊断框架

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-04 08:00
- AIHOT 分数：69
- AIHOT 链接：https://aihot.virxact.com/items/cmqazuscw00xcsl35vwsgurk2
- 原文链接：https://arxiv.org/abs/2606.12451

## AI 摘要

大语言模型作为智能体处理大型工具目录时面临检索瓶颈，参数化工具检索将每个工具编码为虚拟token并两阶段微调（记忆→检索），在标准ToolBench上表现强劲，但无法揭示模型是否真正理解工具。ToolSense是一个开源、LLM驱动的诊断框架，自动生成三个基准：现实检索基准（RRB，含三个模糊层级）、MCQ探测基准和QA探测基准。应用于ToolBench约4.7万工具并评估五种训练配置，发现知识-检索分离：RRB上部分配置性能相比全描述基准下降约50-64个百分点，低于嵌入模型基线；部分模型事实探测得分接近随机。框架和基准已开源。

## 正文

Large language models deployed as agents over large tool catalogs face a critical tool-retrieval bottleneck. As embedding-based retrieval approaches rely on compact encoders that may under-capture specialized tool semantics, parametric tool retrieval addresses this by encoding each tool as a virtual token appended to the LLM vocabulary, fine-tuned in two stages (memorization then retrieval SFT) to use the LLM as a retriever, achieving strong performance on standard ToolBench retrieval benchmarks. Yet these benchmarks use verbose, fully-specified queries, and their evaluation applies constrained decoding that restricts outputs to valid token paths, neither reveals whether the model actually understands its tools. We introduce ToolSense, an open-source LLM-powered diagnostic framework that takes any tool catalog as input and automatically generates three benchmarks: a Realistic Retrieval Benchmark (RRB) with queries at three ambiguity tiers, an MCQ probing benchmark, and a QA probing benchmark. Applying ToolSense to ToolBench (~47k tools) and evaluating five parametric model training configurations reveals a knowledge-retrieval dissociation: on RRB queries, several configurations collapse by ~50-64 percentage points compared to fully-specified ToolBench benchmarks, falling below the embedding-model baseline. Additionally, despite strong retrieval performance, some models score near-random on factual probes, suggesting a knowledge-retrieval dissociation. We open-source the ToolSense framework and the ToolBench diagnostic benchmarks at https://github.com/SAP/toolsense.
