# 惊讶于一年多后，用单一数字比较推理模型评估结果仍是常态

- 来源：Noam Brown (@polynoamial)
- 发布时间：2026-04-09 13:59
- AIHOT 链接：https://aihot.virxact.com/items/cmnw1yur701caslc3322t11yv
- 原文链接：https://x.com/polynoamial/status/2042120295449010612

## AI 摘要

作者吐槽业界仍习惯用单一数字评估推理模型，引用观点指出 MMLU/GSM8K 等基准早已过时却仍在被报告，认为 Intelligence/$（智能性价比）才是更优指标，并以 o1-mini 发布时的多维对比图表为例说明。

## 正文

I'm surprised that， more than a year later， it's still the norm to compare reasoning models on evals by a single number.

### 引用推文

> Noam Brown：LLM evals are slow to adapt. MMLU/GSM8K continued to be reported long after they were obsolete. I think the next thing to go away will be comparing models on ev...