A Matter of TASTE: 提升AI智能体评测基准的覆盖率与难度
阅读原文· arxiv.org针对现有智能体评测基准(如τ^2-Bench)因难度饱和而难以评估能力上限的问题,研究提出TASTE方法。该方法通过反转传统任务构建流程,利用基于LLM判断有效性信号训练的自适应对比n-gram模型生成有效工具序列,经聚类筛选与迭代难度演化,自动构建出工具覆盖更广、难度更高的τ^c-Bench基准。对11组智能体/大语言模型对的评估显示,多个在τ^2-Bench上接近饱和的模型在τ^c-Bench上性能大幅下降,且生成任务要求的唯一工具组合数量显著增加,表明现有基准高分常反映测试集饱和而非模型稳健能力。
As agent capabilities advance, existing benchmarks, such as τ^2-Bench, are becoming increasingly saturated. Yet constructing new benchmark tasks remains complex, costly, and labor-intensive. Moreover, the standard approach, in which scenarios are first written in natural language and then mapped to tool sequences, captures only a narrow subset of the tool-use patterns agents exercise. In this paper, we address these problems by reversing the task construction process. We propose TASTE: Task Synthesis from Tool Sequence Evolution, an automatic method that generates challenging tasks with broader tool-use coverage. TASTE utilizes an Adaptive Contrastive n-gram model trained on LLM-judged validity signals. This enables sampling valid tool sequences that cover a vast range of tool combinations. TASTE then selects representative sequences from the pool via clustering, instantiates them into complete benchmark tasks, and refines them through iterative difficulty evolution. Using TASTE, we construct τ^c-Bench, a challenging extension of the three domains of τ^2-Bench. We evaluate 11 agent/user LLM pairs and find that models nearly saturating τ^2-Bench suffer severe performance drops on our tasks (e.g., Gemini-3-Flash falls from 0.82!-!0.94 to 0.28!-!0.61). Beyond increasing difficulty, our generated tasks more than double the number of unique tool combinations agents must execute. Our results suggest high scores on existing benchmarks often reflect saturation rather than robust task-solving ability. By automating the generation of difficult, high-coverage benchmarks, TASTE enables continuous, scalable evaluation of future agents.