Ethan Mollick@emollick

2026-05-03 18:56·60天前

AI 摘要

对前沿智能体在较长任务上的性能进行基准测试正变得越来越困难。重复测量的成本非常高，而且使用受控框架中的模型与通过API使用模型之间存在差异。我怀疑基准测试低估了进展，它们是为模型设计的，而非为受控智能体。

Its getting hard to benchmark frontier agent performance on longer tasks. Repeated measurement is very expensive and there are differences between using models in harnesses versus via APIs.

I suspect benchmarks understate progress， they are built for models， not harnessed agents

智能体大佬观点现象/趋势评测/基准

在 X 查看原推导出 Markdown

Ethan Mollick@emollick · X

57导出 Markdown

2026-05-03 18:56·60天前

在 X 看原推· x.com

AI 摘要

Its getting hard to benchmark frontier agent performance on longer tasks. Repeated measurement is very expensive and there are differences between using models in harnesses versus via APIs.

I suspect benchmarks understate progress， they are built for models， not harnessed agents

智能体大佬观点现象/趋势评测/基准

在 X 查看原推