我们来聊聊评估。 我们一直在寻找更好的方法来衡量和预测模型进展,尤其是在基准测试逐渐饱和或被钻空子的时候。 领导我们前沿评估团队的 @tejalpatwardhan 与 @andrewmayne 谈到了评估为何重要,以及接下来模型需要被评判的标准。
Let's talk about evals.
We're always looking for better ways to measure and forecast model progress, especially as benchmarks get saturated or gamed.
@tejalpatwardhan, who leads our frontier evals team, spoke to @andrewmayne about why evals matter and what models need to be judged on next.