TACO:面向智能体工具使用的工具增强信用优化
阅读原文· arxiv.orgTACO是一种基于GRPO的变体,专为代码工具agent设计。它通过两个耦合优势通道解决工具调用信用分配:Differential Answer-Probe Reward(DAPR)在推理中插入探针token,自监督比较有无工具时的预测差异,为每次调用赋予正/负/零价值,无需外部评判器;Outcome-Gated Advantage Routing(OGAR)根据调用结果将最终答案优势仅分配给导致正确输出的段,抑制无用调用。经两阶段SFT+RL训练后,TACO在感知、推理和通用多模态基准上取得一致准确率提升,且学会仅在必要时调用工具。
Agentic multimodal models perform diverse operations on an image via code and reason over the returned view, an effective paradigm for fine-grained visual question answering. However, code operations can be useful, redundant, or misleading. Outcome-only rewards cannot precisely distinguish these cases, and existing process rewards either fail to attribute final correctness to individual tool calls, or require an external judge model. To address this, we introduce Tool-Augmented Credit Optimization (TACO), a GRPO variant for code-tool agents built on two coupled advantage channels. The first, Differential Answer-Probe Reward (DAPR), is a self-supervised, judge-free tool-contribution advantage that credits each tool call by its own effect on answering correctly. Probe tokens inserted into the model's reasoning elicit its predictions with and without the tool, and the difference in outcome reward is taken as the call's value: positive for a useful call, negative for a misleading one, and zero for one that changes nothing. This reuses the existing answer checker with no auxiliary judge, and, being a difference rather than an absolute probe score, is naturally robust to probe-hacking. The second is the outcome advantage from the final answer, distributed by Outcome-Gated Advantage Routing (OGAR): a parameter-free rule that, conditioned on the call's outcome, delivers this credit only to the responsible segments, suppressing wasted tool calls without any cost term. We train TACO through a two-stage SFT+RL pipeline. Extensive experiments across perception, reasoning, and general multimodal benchmarks show that it yields consistent accuracy gains and learns to invoke its tools only when they help.