Noam Brown@polynoamial

2026-04-09 13:59·84天前

AI 摘要

作者吐槽业界仍习惯用单一数字评估推理模型，引用观点指出 MMLU/GSM8K 等基准早已过时却仍在被报告，认为 Intelligence/$（智能性价比）才是更优指标，并以 o1-mini 发布时的多维对比图表为例说明。

I'm surprised that， more than a year later， it's still the norm to compare reasoning models on evals by a single number.

Noam BrownLLM evals are slow to adapt. MMLU/GSM8K continued to be reported long after they were obsolete. I think the next thing to go away will be comparing models on ev...

Meta 大佬观点推理

在 X 查看原推导出 Markdown

Noam Brown@polynoamial · X

导出 Markdown

2026-04-09 13:59·84天前

在 X 看原推· x.com

AI 摘要

I'm surprised that， more than a year later， it's still the norm to compare reasoning models on evals by a single number.

Noam BrownLLM evals are slow to adapt. MMLU/GSM8K continued to be reported long after they were obsolete. I think the next thing to go away will be comparing models on ev...

Meta 大佬观点推理

在 X 查看原推