SlimSearcher：通过自适应奖励门控训练效率感知的Web智能体

2026-06-05 08:00·28天前

AI 摘要

SlimSearcher提出兼顾准确性与计算成本的训练框架。监督微调阶段采用Pareto高效过滤，仅保留成功且经济的轨迹；强化学习阶段引入自适应奖励门控，动态评估工具与token效率，避免绝对惩罚导致的简洁性偏差（brevity bias）和奖励作弊（reward hacking）。在GAIA、BrowseComp和XBenchDeepSearch等基准上，工具调用轮次减少17%-58%，同时保持或提升准确率。

原文 · 未翻译

Deep research agents have demonstrated remarkable capabilities in complex information-seeking tasks, yet this power comes at a steep computational cost. Driven by accuracy-focused training paradigms, current models adopt brute-force strategies characterized by blind tool dependency and performative reasoning-generating long, redundant trajectories that are far from necessary for resolving these tasks, leading to wasteful tool calls and excessive token consumption. To overcome this efficiency trap, we propose SlimSearcher, a principled framework that pushes the Pareto frontier between accuracy and computational cost across both Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). In the SFT stage, SlimSearcher employs Pareto-efficient filtration to distill trajectories that are both successful and economical, guiding the model toward inherently efficiency-aware search behaviors. During RL, we introduce Adaptive Reward Gating, a dynamic reward-shaping mechanism that evaluates relative tool and token efficiency within a sampled cohort. By cascading these adaptive efficiency metrics with a strict correctness gate, our approach effectively avoids the brevity bias associated with absolute penalties and mitigates reward hacking. Extensive experiments on long-horizon benchmarks, including GAIA, BrowseComp, and XBenchDeepSearch, demonstrate that SlimSearcher reduces average tool-call rounds by 17%-58% while maintaining or improving accuracy.

HuggingFace Daily Papers（社区热门论文）

58导出 Markdown

SlimSearcher：通过自适应奖励门控训练效率感知的Web智能体

2026-06-05 08:00·28天前

阅读原文· arxiv.org

AI 摘要

原文 · 保持原样，未翻译