对前沿智能体在较长任务上的性能进行基准测试正变得越来越困难。重复测量的成本非常高,而且使用受控框架中的模型与通过API使用模型之间存在差异。 我怀疑基准测试低估了进展,它们是为模型设计的,而非为受控智能体。
Its getting hard to benchmark frontier agent performance on longer tasks. Repeated measurement is very expensive and there are differences between using models in harnesses versus via APIs.
I suspect benchmarks understate progress, they are built for models, not harnessed agents