新基准测试显示 Claude Mythos 与 GPT-5.5 可自主开发真实浏览器漏洞利用程序
阅读原文· the-decoder.com卡内基梅隆大学的研究人员构建了一项新基准,用于衡量AI代理在利用谷歌V8引擎真实漏洞方面的能力。测试显示,Claude Mythos 的表现大幅领先 GPT-5.5,但其使用成本高达后者的十二倍。该基准表明,当前先进的AI模型已能自主开发有效的浏览器漏洞利用程序,这凸显了AI在网络安全领域兼具攻防双重潜力与风险。
New benchmark shows Claude Mythos and GPT-5.5 can develop real browser exploits autonomously
Researchers at Carnegie Mellon University built a new benchmark that measures how far AI agents can go when exploiting real-world vulnerabilities in Google's JavaScript engine V8. Mythos leads GPT-5.5 by a wide margin, but it costs a fortune.
Unlike previous tests, the benchmark doesn't just check whether a bug gets triggered. It scores progress across five tiers, all the way up to arbitrary code execution, running whatever commands you want on the target system. V8 powers systems like Chrome, Edge, Node.js, and Cloudflare Workers.
Anthropic's Claude Mythos Preview, with occasional human hints ("nudges"), hit an average score of 9.90 out of 16 and reached the highest tier on 21 of 41 vulnerabilities. OpenAI's GPT-5.5 trailed far behind at 5.51 points, reaching the top tier on just two.