AI 摘要
我很高兴看到一项新评测得分如此之低。当我们发布GPT-5.5时,几乎每个基准测试的得分都超过了50%。 是时候淘汰像GQPA这样的评测,引入一套新的评估体系了。
I love seeing a new eval with such low scores. When we announced GPT-5.5, almost every benchmark had a score above 50%.
It's time to retire evals like GQPA and bring in a new set.
The first ProgramBench task was just solved by GPT 5.5 high/xhigh. Interestingly, high/xhigh picked two different languages for the task (C vs Python). GPT 5.5 ...