# AI模型基准测试遭质疑

- 来源：Lee Robinson (@leerob)
- 发布时间：2026-06-03 01:43
- AIHOT 分数：58
- AIHOT 链接：https://aihot.virxact.com/items/cmpx174re001rslgy48kz2de2
- 原文链接：https://x.com/leerob/status/2061866385752326358

## AI 摘要

Lee Robinson 批评当前AI模型基准测试存在局限，如 SWE-bench 已过时且结果难以复现。评测分数易受硬件、GPU差异和prompt微小改动影响，波动明显。这些基准对模型训练者衡量进展有价值，但对普通用户，当分数饱和时便失去参考意义。他指出，模型的交互风格、个性等重要因素无法被现有公共基准充分衡量。因此，建议用户综合参考多个基准，并亲自使用模型以形成判断。

## 正文

Quick rant on AI model benchmarks：

- Some of the most popular ones are no longer helpful （SWE-bench1）
- It can be very hard to reproduce reported results （so lots of variance）
- Take them with a grain of salt， look at the average across many

We need some creative new ideas for AI model marketing. Supportive of a Survivor spin-off （who is the AI Jeff Probst！？）.

I get why every model release shows benchmark scores as the headline. It's actually pretty hard to describe how a model has improved without it sounding like fluff. And also it sounds boring to say the same thing over and over （"it's better at following instructions" repeat x10）.

Benchmarks make it very clear there is a number， which likely started bad， and is now going up. Yay！ The reality is that benchmarks are most useful to those *training* the model so they know where to improve.

Model labs use these benchmarks to measure progress， which is why having non-saturated benchmarks is extremely helpful. If you see models getting 90% on an eval， it's probably time to make a harder version.

I do think there's a word of caution for everyone interpreting benchmarks. It's very hard to get exactly the same scores， which is why some benches show error bars and do the average over multiple runs.

But even further， the hardware and GPUs the evals are running on really matter！ Small differences there， or minor tweaks to the prompt， can swing scores by multiple percentage points2.

All of that to say， it's important to look at many different benchmarks， and then actually use the model to make your own opinion. For example， there's recently been a lot of debate on here about Opus 4.8 not benchmarking as well as other models. But personally I've found the model really good from my own usage. Your mileage may vary！

There aren't many high-quality public benchmarks that measure things like the UX of the model responses， the style of the messages， the warmth or directness of the "personality". These things matter *a lot* for the day-to-day usage. How the model performs in the real world is often different from very specific benches.

In summary， benchmarks matter but they are not a substitute for extensively testing the model yourself with real work.

1： https://openai.com/index/why-we-no-longer-evaluate-swe-bench-verified
2： https://www.anthropic.com/engineering/infrastructure-noise

### 引用推文

> lilly sharples：I'm tired of useless AI benchmarks. How about we give three people a different model, strand them on an island, and see who survives the longest
