Rohan Paul@rohanpaul_ai

2026-05-28 04:44·36天前

AI 摘要

Datacurve发布了新编程基准DeepSWE，旨在揭示模型在长期软件工程任务上的真实能力差距。在该基准上，GPT-5.5得分为70%，而GPT-5.4为56%，Claude Opus 4.7为54%，突显了模型间的显著差异。与旧有基准不同，DeepSWE使用原创任务，要求智能体在代码库中自主搜索、理解设计并修改多个文件。其解决方案所需代码量是SWE-bench Pro的5.5倍，输出token约2倍，反映了开发者日常工作中的实际挑战。

Datacurve launches DeepSWE， a tougher coding benchmark made to show where leading models truly separate.

GPT-5.5 hits 70%， while GPT-5.4 reaches 56% and Claude Opus 4.7 reaches 54%， making a gap that older benchmarks largely hid.

Its a long-horizon software engineering benchmark.

DeepSWE differs from older coding benchmarks in the source of the exam： older tests often reuse public GitHub issues and PRs， while DeepSWE uses original tasks， so models are less likely to have seen the answer during training.

The work is also bigger even when the prompt is shorter， because older tests often tell the model what area to touch， while DeepSWE makes the agent search the repo， understand the design， edit multiple files， and avoid breaking old behavior.

On DeepSWE， prompts are half the length of SWE-bench Pro's， yet solutions require 5.5x more code and ~2x more output tokens.

The grading is different too， because many older benchmarks reuse tests from one merged PR， while DeepSWE checks whether the requested behavior actually works， even if the model solves it in a different valid way.

Serena Ge (Datacurve)Today we're releasing DeepSWE, a new standard for agentic coding benchmarks. On public leaderboards, top models often look relatively close in capability. DeepS...

编码评测/基准

Rohan Paul@rohanpaul_ai · X

60导出 Markdown

2026-05-28 04:44·36天前

在 X 看原推· x.com

AI 摘要

Datacurve launches DeepSWE， a tougher coding benchmark made to show where leading models truly separate.

GPT-5.5 hits 70%， while GPT-5.4 reaches 56% and Claude Opus 4.7 reaches 54%， making a gap that older benchmarks largely hid.

Its a long-horizon software engineering benchmark.

DeepSWE differs from older coding benchmarks in the source of the exam： older tests often reuse public GitHub issues and PRs， while DeepSWE uses original tasks， so models are less likely to have seen the answer during training.

The work is also bigger even when the prompt is shorter， because older tests often tell the model what area to touch， while DeepSWE makes the agent search the repo， understand the design， edit multiple files， and avoid breaking old behavior.