Cognition发布企业级AI代码评估(eval),支持长达100小时深度测试(METR仅约16小时),并附带财务担保:若Devin产出价值低于费用,Cognition将补贴至达标,最高1000万美元。METR数据集覆盖ML工程、GPU内核、网络安全,使用GPT-4o和GPT-5从Claude Code转录估算人类时间,rlog=0.83。Cognition数据集来自126位Devin用户的258个真实会话(Java/TS/Python/C#功能开发、bug修复、迁移),保留集rlog=0.74。
Finally! the first eval ship from cog!!!!!!!!!! 👼🏼
To contextualize: @METR_Evals cap out at ~16 hours.
Cog has private enterprise evals up to 100hrs, and is confident enough to put a financial guarantee on it 🤯
METR dataset: ML eng, GPU kernels, cybersecurity
"METR (2026) used a combination of GPT-4o and GPT-5 to estimate the human-equivalent times from compressed Claude Code transcripts. These transcripts were collected from 7 METR technical staff on 34 sessions labeled on human ground truth". rlog of 0.83
Cog dataset: real life java/typescript/python/c# feature dev, bugfixes, migrations