超越静态排行榜：LLM智能体评估的预测有效性研究

2026-06-18 08:00·3天前

AI 摘要

研究指出聚合分数排行榜无法反映部署场景真实表现，排名在分布外设置中不稳定。基于一个MCP工业基准进行了14项并行实现，涵盖多模态扩展、编排、检索、推理、基础设施及评估探针，并合并7个先前智能体基准。提出以预测有效性（样本内与样本外排名相关性）替代均值排名，构建12层测量框架，暴露HELM及其后继者忽视的部署维度。给出三个具有明确阈值的可证伪分布外标准，最后呈现预注册试点设计与下一代基准报告的前瞻性愿景。

原文 · 未翻译

Agent benchmarks are growing fast, but no single benchmark touches more than four or five of the dimensions that deployment exposes. This paper aggregates the largest coordinated deep-dive of one MCP-based industrial-agent benchmark to date: fourteen parallel implementation studies covering new asset classes (including a multi-modal visual extension), alternative orchestrations, retrieval strategies, reasoning modes, infrastructure optimizations, and evaluation-methodology probes. Consolidating those studies with seven prior agent benchmarks, we argue that aggregate-score leaderboards systematically underspecify deployed-agent evaluation. Rankings derived from aggregate scores do not transfer to out-of-distribution settings; recent public-to-hidden competition retrospectives provide direct empirical evidence of this rank instability. We propose ranking configurations by predictive validity, the correlation between in-sample and out-of-sample rank, rather than in-sample mean, and report a twelve-tier measurement apparatus that exposes the deployment-relevant dimensions HELM and its agent-era successors collapse. The position is operationalized through three falsifiable out-of-distribution criteria with explicit thresholds; existing evidence partly supports it but is too thin to confirm. We close with a pre-registered pilot design and a field-level vision for what the next generation of agentic benchmarks should report.

智能体MCP/工具论文/研究

HuggingFace Daily Papers（社区热门论文）

超越静态排行榜：LLM智能体评估的预测有效性研究

2026-06-18 08:00·3天前

AI 摘要

原文 · 保持原样，未翻译

智能体MCP/工具论文/研究

阅读原文arxiv.org