# GPT-5.5低分评测引反思，呼吁更新评估体系

- 来源：Noam Brown (@polynoamial)
- 发布时间：2026-05-13 01:42
- AIHOT 分数：58
- AIHOT 链接：https://aihot.virxact.com/items/cmp2y32b601mdsl1qqfiaasvm
- 原文链接：https://x.com/polynoamial/status/2054255862441812099

## AI 摘要

我很高兴看到一项新评测得分如此之低。当我们发布GPT-5.5时，几乎每个基准测试的得分都超过了50%。

是时候淘汰像GQPA这样的评测，引入一套新的评估体系了。

## 正文

I love seeing a new eval with such low scores. When we announced GPT-5.5， almost every benchmark had a score above 50%.

It's time to retire evals like GQPA and bring in a new set.

### 引用推文

> Kilian Lieret：The first ProgramBench task was just solved by GPT 5.5 high/xhigh. Interestingly, high/xhigh picked two different languages for the task (C vs Python). GPT 5.5 ...
