一篇新论文提出“Agents’ Last Exam”基准,测试 AI 智能体完成真实专家工作的能力。任务来自工程、金融、医学、法律、媒体、科学等 55 个数字工作领域的实际项目,要求智能体使用文件、浏览器、命令行、桌面软件等常规工具产出可交付成果。评测采用自动检查或严格评分标准。结果显示,当前最强智能体在最难任务层级的平均完全通过率仅 2.6%,远低于其基准测试分数所暗示的水平。论文指出,基准成功尚未转化为广泛的职场能力。
Today's frontier agents are far less ready for real-world automation than their benchmark scores suggest.
This paper proposes a Agents' Last Exam, a benchmark that asks AI agents to finish real expert work, and today's agents mostly fail.
Even strong agents of today are nowhere near reliable on the hardest real workflows, which means benchmark success has not yet become broad workplace capability.
So this paper shifts the question from "can AI answer hard questions?" to "can AI complete real work that people get paid to do?"
Most of today's AI benchmarks show impressive scores, but they do not prove that agents can finish useful work in real jobs.