深度研究智能体轨迹中的跨度级错误定位研究
阅读原文· arxiv.org深度研究型AI智能体通过搜索、工具调用等长轨迹执行任务,但最终答案评估无法揭示轨迹中导致错误的环节。研究针对跨度级错误定位,从两个框架、三个模型和三个基准中收集2790条真实轨迹,经LLM辅助专家标注后构建1000实例的评测基准TELBench。同时提出以主张为中心的审计框架DRIFT,追踪智能体主张并核对轨迹证据支持度。实验表明,DRIFT在跨度级错误定位和首次错误准确率上提升高达30个百分点。
Deep-research agents solve tasks through long trajectories of search, tool use, evidence inspection, and answer synthesis. Evaluation based on final answers shows whether an agent succeeds, but not which parts of the trajectory make the answer unreliable. We study span-level error localization for deep-research agents. We collect 2,790 real trajectories from two agent frameworks, three backbone models, and three benchmarks, convert raw logs into semantic spans, and annotate harmful error spans through LLM-assisted expert review. From these annotations, we build TELBench, a 1,000-instance benchmark for identifying error spans among normal exploration, failed searches, tentative hypotheses, and harmless noise. We further propose DRIFT, a claim-centric auditing framework that tracks agent claims, checks their support in trajectory evidence, and marks spans where unsupported or conflicting claims affect the answer path. Experiments across model families and auditing frameworks show that DRIFT improves span-level error localization and first-error accuracy by up to 30 percentage points. Our work provides a process-level view of reliability in deep-research agents.