新基准测试显示 Claude Mythos 与 GPT-5.5 可自主开发真实浏览器漏洞利用程序

2026-05-16 21:08·47天前·Matthias Bastian

AI 摘要

卡内基梅隆大学的研究人员构建了一项新基准，用于衡量AI代理在利用谷歌V8引擎真实漏洞方面的能力。测试显示，Claude Mythos 的表现大幅领先 GPT-5.5，但其使用成本高达后者的十二倍。该基准表明，当前先进的AI模型已能自主开发有效的浏览器漏洞利用程序，这凸显了AI在网络安全领域兼具攻防双重潜力与风险。

原文 · 未翻译

New benchmark shows Claude Mythos and GPT-5.5 can develop real browser exploits autonomously

Researchers at Carnegie Mellon University built a new benchmark that measures how far AI agents can go when exploiting real-world vulnerabilities in Google's JavaScript engine V8. Mythos leads GPT-5.5 by a wide margin, but it costs a fortune.

Unlike previous tests, the benchmark doesn't just check whether a bug gets triggered. It scores progress across five tiers, all the way up to arbitrary code execution, running whatever commands you want on the target system. V8 powers systems like Chrome, Edge, Node.js, and Cloudflare Workers.

Anthropic's Claude Mythos Preview, with occasional human hints ("nudges"), hit an average score of 9.90 out of 16 and reached the highest tier on 21 of 41 vulnerabilities. OpenAI's GPT-5.5 trailed far behind at 5.51 points, reaching the top tier on just two.

The gap gets even wider in fully autonomous mode. Mythos scored 9.55 points there, barely any drop. GPT-5.5 via Codex managed only 4.30. None of the other tested models achieved full code execution (T1).

The price tags differ sharply: the full Mythos test run across 122 episodes cost about $36,428, according to ExploitBench. GPT-5.5 via Codex ran 123 episodes for roughly $3,075, about twelve times cheaper. The UK's AI Safety Institute also confirmed that Mythos performs somewhat better than GPT-5.5 but at a much higher cost in a recent test. The price gap suggests OpenAI could close the performance gap by throwing more compute at the problem.

Mythos works like a "fairly competent" browser security researcher

ExploitBench co-author Seunghyun Lee—himself an experienced security researcher with over 20 reported browser vulnerabilities—reviewed the Mythos transcripts one by one. His takeaway: the model works like a "fairly competent browser / JS engine security researcher."

In one case, Mythos developed an exploit technique that Lee and a colleague had previously dismissed as too complex. In another, it reproduced a vulnerability (CVE-2024-0519) that human researchers had failed to crack for over a year, according to Lee.

The Decoder：AI News（RSS）

45导出 Markdown

新基准测试显示 Claude Mythos 与 GPT-5.5 可自主开发真实浏览器漏洞利用程序

2026-05-16 21:08·47天前·Matthias Bastian

阅读原文· the-decoder.com

AI 摘要

原文 · 保持原样，未翻译

New benchmark shows Claude Mythos and GPT-5.5 can develop real browser exploits autonomously