OpenAI@OpenAI

2026-06-17 01:23·16天前

AI 摘要

我们来聊聊评估。我们一直在寻找更好的方法来衡量和预测模型进展，尤其是在基准测试逐渐饱和或被钻空子的时候。领导我们前沿评估团队的 @tejalpatwardhan 与 @andrewmayne 谈到了评估为何重要，以及接下来模型需要被评判的标准。

Let's talk about evals.

We're always looking for better ways to measure and forecast model progress， especially as benchmarks get saturated or gamed.

@tejalpatwardhan， who leads our frontier evals team， spoke to @andrewmayne about why evals matter and what models need to be judged on next.

OpenAI@OpenAI · X

2026-06-17 01:23·16天前

AI 摘要

Let's talk about evals.

We're always looking for better ways to measure and forecast model progress， especially as benchmarks get saturated or gamed.

@tejalpatwardhan， who leads our frontier evals team， spoke to @andrewmayne about why evals matter and what models need to be judged on next.