# CursorBench 3.1

- 来源：Hacker News 热门（buzzing.cc 中文翻译）
- 作者：handfuloflight
- 发布时间：2026-07-03 00:08
- AIHOT 分数：60
- AIHOT 链接：https://aihot.virxact.com/items/cmr3pfv3r00u0sl7lla37j4fi
- 原文链接：https://cursor.com/evals

## AI 摘要

CursorBench 3.1 新增代码库理解、bug 查找、规划和代码审查任务，并改进了编辑任务的评分标准。排行榜显示，Fable 5 Max 以 72.9% 得分、$18.02 每任务成本居首，Fable 5 Extra High（72.0%，$13.74）和 Fable 5 High（70.6%，$10.81）紧随其后。Opus 4.7 Max 得分 64.8%、成本 $11.02；GPT-5.5 Extra High 得分 64.3%、成本 $4.37；Composer 2.5 得分 63.2%、成本仅 $0.55。共收录 36 个模型/配置，得分范围 72.9%–31.9%。

## 正文

Model

1Fable 5 Max72.9%$18.0263,84276

2Fable 5 Extra High72.0%$13.7448,75463

3Fable 5 High70.6%$10.8137,17354

4Fable 5 Medium69.8%$8.2728,50747

5Opus 4.7 Max64.8%$11.0262,98996

6GPT-5.5 Extra High64.3%$4.3717,90546

7Fable 5 Low64.2%$5.7018,88236

8Opus 4.8 Max63.8%$7.5977,37060

9Composer 2.563.2%$0.5515,15237

10GPT-5.5 High62.6%$3.5913,32940

11Opus 4.8 Extra High62.1%$6.1455,62254

12Opus 4.7 Extra High61.6%$7.1143,94272

13Sonnet 5 Max61.2%$6.8793,48593

14Opus 4.7 High59.4%$5.0132,22759

15GPT-5.5 Medium59.2%$2.229,06535

16Opus 4.8 High58.4%$4.4136,78845

17Sonnet 5 Extra High58.4%$5.2358,22886

18Sonnet 5 High57.0%$3.7441,73566

19Opus 4.8 Medium56.6%$3.8331,68441

20Sonnet 5 Medium54.9%$2.5727,46953

21GLM 5.2 Max54.6%$3.1151,31283

22Opus 4.8 Low54.3%$2.9322,72636

23Opus 4.7 Medium52.7%$2.9319,19341

24Kimi K2.7 Code52.7%$1.9232,90270

25Composer 252.2%$0.5614,16340

26GLM 5.2 High50.7%$2.4630,62176

27Gemini 3.5 Flash49.8%$1.9435,10579

28Sonnet 4.6 Max49.0%$3.0940,28055

29GPT-5.5 Low48.8%$1.194,92324

30Sonnet 4.6 High48.8%$3.0637,35257

31Opus 4.7 Low48.3%$1.8713,16429

32Sonnet 5 Low47.7%$1.4617,02837

33Kimi 2.647.6%$1.2724,78356

34Sonnet 4.6 Medium46.0%$2.6431,36050

35Sonnet 4.6 Low41.5%$1.8921,21150

36Kimi 2.531.9%$0.879,44630

Changelog

CursorBench 3.1

Introduced problems focused on codebase understanding, bugfinding, planning, and code review.

Improved grading criteria for some edit tasks.

CursorBench 3.0

Initial set of tasks focused on edit, refactor, and bugfix problems.

Avg cost / task is computed by applying each model's published per-million-token pricing (input, cache read, cache write, and output) to the tokens it used on each CursorBench 3.1 task, then averaging across tasks. Results are subject to variance; small differences in scores may not be statistically meaningful.
