论文《Scalable Evaluation for AI Agents》提出Human-on-the-Bridge评估方法:将人类判断前置到可复用评估资产中,专家在上游策划评估智慧,而非在测试循环中逐一审查输出。现有方法各有局限:Benchmark测量固定能力,人工审核不具可扩展性,LLM-as-Judge存在评估器设计问题,红队测试偶发,trace审计需明确证据规则。AI智能体需作为行为系统评估,因其多轮推理、调用工具、维护上下文、遵循策略并在不确定性下行动。
>> Scalable Evaluation for AI Agents <<
If you run agent evaluation in production, this one is worth your time.
It shows that front-loading human judgment into reusable evaluation assets is useful.
But why?
Agents reason across turns, call tools, hold context, follow policies, and act under uncertainty, so they have to be judged as behavioral systems.
Current methods each give a fragment. Benchmarks measure fixed capabilities, human review preserves judgment but does not scale, LLM-as-judge inherits the evaluator design problem, red teaming is episodic, and trace audits need explicit evidence rules.
Human-on-the-Bridge puts human expertise upstream, where experts curate reusable evaluation intelligence before testing rather than reviewing each output in the loop.
Paper: https://arxiv.org/abs/2606.16871
Learn to build effective AI agents in our academy: https://academy.dair.ai/