Cognition 发布 FrontierCode 编码评估,每任务由顶级开源维护者花费 40+ 小时编写。METR 发现 SWEBench 超一半结果为不可合并的垃圾代码。FrontierCode 含 3000+ 评分标准,首次衡量代码是否可合并。最高难度 FC Diamond 上,Opus 4.8 仅得 13.8%。在 FC Extended 最易任务中,Opus 在 2025 年底 4 个月内从 41% 提升至 74%,标志 AI 编码进入"可维护代码"时代。
It's finally out!!! @METR_Evals found that more than half of SWEBench results is unmergeable slop. FrontierCode represents over 1000+ hours of maintainer validated software engineering work most frontier models cannot yet solve, much less solve with high quality.
Cog had IOI Gold medalists and top code maintainers Look At The Data - FrontierCode includes 3000+ rubrics covering code quality and anticheat reward hacking plaguing other benchmarks.
FC Diamond is so hard that Opus 4.8 scores 13.8%.
Three eras of AI coding : Three eras of benchmarks
2021 • Autocomplete : HumanEval 2023 • Passing Tests: SWEBench, TerminalBench 2026 • Maintainable Code: FrontierCode