# 智能体终极考试（Agents' Last Exam）

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-03 08:00
- AIHOT 分数：60
- AIHOT 链接：https://aihot.virxact.com/items/cmq6vsrhn0008slbht3mznghb
- 原文链接：https://arxiv.org/abs/2606.05405

## AI 摘要

AI系统在多项基准上表现强劲，但未转化为经济上有意义的行业部署。新基准Agents' Last Exam（ALE）由250+行业专家联合开发，基于O*NET/SOC 2018联邦职业分类，覆盖13个行业集群、55个子领域、1000+任务，用于评估AI智能体在长周期、高经济价值真实工作流上的表现。当前最难层级平均完全通过率仅2.6%。ALE设计为动态基准，任务池持续扩展，旨在弥合基准成功与GDP影响之间的差距。

## 正文

Recent AI systems have achieved strong results on a wide range of benchmarks, yet these gains have not translated into economically meaningful deployment across many professional domains. We argue that this gap is largely an evaluation problem: widely used benchmarks lack sustained performance measurement on real and economically valuable workflows. This paper introduces Agents' Last Exam (ALE), a benchmark designed to evaluate AI agents on long-horizon, economically valuable, real-world tasks with verifiable outcomes. Developed in collaboration with 250+ industry experts, ALE covers non-physical industries defined with reference to O*NET / SOC 2018 (the U.S. federal occupational taxonomy). It is organized around a task taxonomy with 55 subfields grouped into 13 industry clusters covering 1K+ tasks. Current results show that the hardest tier remains far from saturated: across mainstream harness and backbone configurations, the average full pass rate is 2.6%. ALE is designed as a living benchmark: its task pool grows continuously as new workflows and industries are onboarded. More broadly, ALE is intended not merely as another leaderboard, but as an instrument for closing the gap between benchmark success and GDP-relevant impact.
