Cognition 发布 FrontierCode 编码基准,评测 AI 生成的代码是否达到维护者可合并的质量,而非仅通过测试。基准含 150 个任务(Main 最难 100 个,Diamond 最难 50 个),由 20 余位开源维护者设计,每个任务耗时超 40 小时。评分设阻隔项(如破坏行为、缺失逻辑等)和加权项(可读性、类型安全等)。额外包含反向测试、范围检查、自适应评分。在 Diamond 子集上,Claude Opus 4.8 得分 13.4%,GPT-5.5 6.3%,Gemini 3.1 Pro 4.7%,开源最佳 Kimi K2.6 3.8%,显示顶尖模型在可合并代码上仍表现糟糕。
Incredible! This is just the benchmark we needed.
Claude Opus 4.8, achieves a score of only 13.4%. Other models score even lower: GPT-5.5 receives 6.3%, Gemini 3.1 Pro 4.7%, and others even less.
Cognition is introducing FrontierCode, a coding benchmark built to test whether AI code is good enough for a real maintainer to merge, not just whether it passes tests.
FrontierCode asks a harder question: did the model produce a clean, limited, well-tested, readable patch that fits the project's existing style and would survive serious code review?