AI 摘要
新发布的独立基准测试 DeepSWE 结果更贴近开发者日常体验。测试显示,在编程任务上,GPT-5.5 得分为 70%,而 Claude Sonnet 得分为 32%,两者差距显著。DeepSWE 聚焦于 AI 智能体在真实工作流中的核心能力,即能否仅凭简短提示词,准确定位代码库并干净地完成修改,无需用户列举具体文件。原文指出,这验证了许多开发者长期以来的观察,并批评了 SWE-Bench 因数据集污染和验证机制较弱而常无法反映真实能力的问题。
Excited to see more independent benchmarks like that which are not contaminated (trained on by major models).
Many developers have suspected for months that GPT-5.5 outperforms Claude Sonnet for coding. But SWE-Bench reported near-parity, and it made people question wha...