Tibo@thsottiaux

2026-05-28 14:45·35天前

AI 摘要

新发布的独立基准测试 DeepSWE 结果更贴近开发者日常体验。测试显示，在编程任务上，GPT-5.5 得分为 70%，而 Claude Sonnet 得分为 32%，两者差距显著。DeepSWE 聚焦于 AI 智能体在真实工作流中的核心能力，即能否仅凭简短提示词，准确定位代码库并干净地完成修改，无需用户列举具体文件。原文指出，这验证了许多开发者长期以来的观察，并批评了 SWE-Bench 因数据集污染和验证机制较弱而常无法反映真实能力的问题。

Excited to see more independent benchmarks like that which are not contaminated （trained on by major models）.

Kol TregaskesMany developers have suspected for months that GPT-5.5 outperforms Claude Sonnet for coding. But SWE-Bench reported near-parity, and it made people question wha...

Anthropic OpenAI 推理编码

在 X 查看原推导出 Markdown

Tibo@thsottiaux · X

63导出 Markdown

2026-05-28 14:45·35天前

在 X 看原推· x.com

AI 摘要

Excited to see more independent benchmarks like that which are not contaminated （trained on by major models）.

Kol TregaskesMany developers have suspected for months that GPT-5.5 outperforms Claude Sonnet for coding. But SWE-Bench reported near-parity, and it made people question wha...

Anthropic OpenAI 推理编码