AI 摘要
作者吐槽业界仍习惯用单一数字评估推理模型,引用观点指出 MMLU/GSM8K 等基准早已过时却仍在被报告,认为 Intelligence/$(智能性价比)才是更优指标,并以 o1-mini 发布时的多维对比图表为例说明。
I'm surprised that, more than a year later, it's still the norm to compare reasoning models on evals by a single number.
LLM evals are slow to adapt. MMLU/GSM8K continued to be reported long after they were obsolete. I think the next thing to go away will be comparing models on ev...