新研究提出“有效反馈计算(EFC)”指标,用于优化AI智能体测试框架的设计。传统评估中,原始token数和工具调用次数预测智能体失败的R²值仅为0.33至0.42,而EFC将此提升至0.99。基于EFC进行资源重分配,可在相同计算量下将智能体成功率从0.27显著提升至0.90,使框架设计从经验猜测变为可预测过程。
// Scaling Laws for Agent Harnesses //
If you build agent harnesses, this one is worth your time.
(bookmark it)
Most harness tuning treats every token and tool call as if volume is all that counts. New research shows that most of it does not.
The work introduces Effective Feedback Compute (EFC), a coordinate that counts only the feedback an agent can actually act on.
Raw token and tool-call counts explain agent failure at R2 of 0.33 to 0.42. EFC pushes that to 0.99.
Why does it matter?
Once you budget by useful feedback instead of raw volume, reallocation alone lifts success from 0.27 to 0.90 at the same compute. This also turns harness design from guesswork into something you can predict.