Lee Robinson 批评当前AI模型基准测试存在局限,如 SWE-bench 已过时且结果难以复现。评测分数易受硬件、GPU差异和prompt微小改动影响,波动明显。这些基准对模型训练者衡量进展有价值,但对普通用户,当分数饱和时便失去参考意义。他指出,模型的交互风格、个性等重要因素无法被现有公共基准充分衡量。因此,建议用户综合参考多个基准,并亲自使用模型以形成判断。
Quick rant on AI model benchmarks:
- Some of the most popular ones are no longer helpful (SWE-bench1)
- It can be very hard to reproduce reported results (so lots of variance)
- Take them with a grain of salt, look at the average across many
We need some creative new ideas for AI model marketing. Supportive of a Survivor spin-off (who is the AI Jeff Probst!?).
I get why every model release shows benchmark scores as the headline. It's actually pretty hard to describe how a model has improved without it sounding like fluff. And also it sounds boring to say the same thing over and over ("it's better at following instructions" repeat x10).
Benchmarks make it very clear there is a number, which likely started bad, and is now going up. Yay! The reality is that benchmarks are most useful to those *training* the model so they know where to improve.