Noam Brown@polynoamial

2026-05-13 01:42·51天前

AI 摘要

我很高兴看到一项新评测得分如此之低。当我们发布GPT-5.5时，几乎每个基准测试的得分都超过了50%。是时候淘汰像GQPA这样的评测，引入一套新的评估体系了。

I love seeing a new eval with such low scores. When we announced GPT-5.5， almost every benchmark had a score above 50%.

It's time to retire evals like GQPA and bring in a new set.

Kilian LieretThe first ProgramBench task was just solved by GPT 5.5 high/xhigh. Interestingly, high/xhigh picked two different languages for the task (C vs Python). GPT 5.5 ...

OpenAI 大佬观点评测/基准

在 X 查看原推导出 Markdown

Noam Brown@polynoamial · X

58导出 Markdown