swyx@swyx

2026-06-05 03:03·28天前

AI 摘要

Cognition发布企业级AI代码评估（eval），支持长达100小时深度测试（METR仅约16小时），并附带财务担保：若Devin产出价值低于费用，Cognition将补贴至达标，最高1000万美元。METR数据集覆盖ML工程、GPU内核、网络安全，使用GPT-4o和GPT-5从Claude Code转录估算人类时间，rlog=0.83。Cognition数据集来自126位Devin用户的258个真实会话（Java/TS/Python/C#功能开发、bug修复、迁移），保留集rlog=0.74。

Finally！ the first eval ship from cog！！！！！！！！！！ 👼🏼

To contextualize： @METR_Evals cap out at ~16 hours.

Cog has private enterprise evals up to 100hrs， and is confident enough to put a financial guarantee on it 🤯

METR dataset： ML eng， GPU kernels， cybersecurity

"METR （2026） used a combination of GPT-4o and GPT-5 to estimate the human-equivalent times from compressed Claude Code transcripts. These transcripts were collected from 7 METR technical staff on 34 sessions labeled on human ground truth". rlog of 0.83

Cog dataset： real life java/typescript/python/c# feature dev， bugfixes， migrations

"We collected a ground-truth dataset by asking Devin users to review recent representative sessions， and estimate how long each completed session would have taken without Devin. Our dataset consists of 258 sessions from 126 users across a diverse set of enterprise customers." rlog of 0.74 on held out set

this is pioneering real world evals work and part 1 of a broader frontier code evals drop that I'm really looking forward to writing up. huge kudos to @annarmitchell and @ryanbai1412 for leading the unglamorous last mile data collection！！

CognitionAI should earn its keep. Introducing the AI Productivity Guarantee. If Devin delivers less engineering value than you're paying for, Cognition will fund your us...

swyx@swyx · X

55导出 Markdown