AI 摘要
“经典”推理基准的配方很简单:纯文本、数小时的时间跨度、易于评分,并带有专家人类基线。 接下来呢?在本周的Gradient Update中,@GregHBurnham 认为只需舍弃这四种成分之一即可。
The recipe for "classic" reasoning benchmarks is simple: text-only, several-hour time horizons, easy to grade, with expert human baselines.
What next? In this week's Gradient Update, @GregHBurnham argues it's as easy as dropping one of these four ingredients.