Ethan Mollick 称赞 AA-Briefcase 为真实知识工作优质基准

Ethan Mollick@emollick

2026-06-19 07:51·2天前

AI 摘要

Ethan Mollick 称赞 AA-Briefcase 是真实知识工作的优质基准，未饱和且含私有保留测试，同时询问是否有与人类的对比。该基准由 @ArtificialAnlys 发布，测试模型在多周、多任务项目中的能力，输入含数万条 Slack 消息和数千封邮件。模型排名：Claude Fable 5（已不可用）以 1587 Elo 居首，Claude Opus 4.8（1356）第二，GLM-5.2 max（1266）第三。结果凸显难度：最佳模型仅 3% 任务满足全部标准，31/91 任务无模型超过 50%，成本跨度约 800 倍。

I have given AA a hard time about its previous agentic evaluation but this looks like a good and impressive benchmark for real world knowledge work that is unsaturated and had private hold out tests.

This is one to watch - I didn't see a human comparison score though？

Artificial AnalysisAnnouncing AA-Briefcase, the benchmark for the next era of agentic knowledge work AA-Briefcase is our new benchmark for testing models on long-horizon knowledge...

智能体Anthropic推理评测/基准

在 X 查看原推

Ethan Mollick@emollick · X