AA-Briefcase 基准测试:最佳 AI 模型仅完全解决 3% 真实知识工作
阅读原文· the-decoder.comArtificial Analysis 推出的 AA-Briefcase 基准测试将 AI 模型置于包含数千份 Slack 消息、邮件、会议记录等碎片化源文件的多周知识工作项目中。表现最好的 Claude Fable 5 通过率最高,但仅在 3% 的任务中完全达标;91 个任务中有 31 个没有任何模型达到 50% 通过率。弱模型因遗漏相关文件或输出无效结果而失败,强模型则因无法跨来源拼接信息而遗漏细节。任务单价差距超过 800 倍,从 DeepSeek V4 Flash 的约 0.04 美元到 Claude Fable 5 的超过 31 美元。
New benchmark exposes how badly AI struggles with real knowledge work
Even the best AI model fails at realistic knowledge work, fully solving just 3 percent of tasks.
The new AA-Briefcase benchmark from Artificial Analysis puts AI models through multi-week knowledge work projects built from thousands of fragmented source files like Slack threads, emails, meeting transcripts, and large data exports. The top performer, Claude Fable 5, hits the highest rubric pass rate but still nails all criteria on just 3 percent of tasks. On 31 out of 91 tasks, no model even clears 50 percent.

The types of errors shift as models get better. Weaker models choke on basic execution as they miss relevant files or spit out unusable results. Stronger models fail more quietly, as they hit the obvious requirements but miss details you'd only catch by piecing together information from multiple sources.
There also is a significant price gap: Per-task costs span more than 800x, from about $0.04 for DeepSeek V4 Flash to over $31 for Claude Fable 5.