Datacurve发布了新编程基准DeepSWE,旨在揭示模型在长期软件工程任务上的真实能力差距。在该基准上,GPT-5.5得分为70%,而GPT-5.4为56%,Claude Opus 4.7为54%,突显了模型间的显著差异。与旧有基准不同,DeepSWE使用原创任务,要求智能体在代码库中自主搜索、理解设计并修改多个文件。其解决方案所需代码量是SWE-bench Pro的5.5倍,输出token约2倍,反映了开发者日常工作中的实际挑战。
Datacurve launches DeepSWE, a tougher coding benchmark made to show where leading models truly separate.
GPT-5.5 hits 70%, while GPT-5.4 reaches 56% and Claude Opus 4.7 reaches 54%, making a gap that older benchmarks largely hid.
Its a long-horizon software engineering benchmark.
- DeepSWE differs from older coding benchmarks in the source of the exam: older tests often reuse public GitHub issues and PRs, while DeepSWE uses original tasks, so models are less likely to have seen the answer during training.
- The work is also bigger even when the prompt is shorter, because older tests often tell the model what area to touch, while DeepSWE makes the agent search the repo, understand the design, edit multiple files, and avoid breaking old behavior.