# 新基准测试显示 Claude Mythos 与 GPT-5.5 可自主开发真实浏览器漏洞利用程序

- 来源：The Decoder：AI News（RSS）
- 作者：Matthias Bastian
- 发布时间：2026-05-16 21:08
- AIHOT 分数：45
- AIHOT 链接：https://aihot.virxact.com/items/cmp8dpsn60hrwslnzh7oud7ym
- 原文链接：https://the-decoder.com/new-benchmark-shows-claude-mythos-and-gpt-5-5-can-develop-real-browser-exploits-autonomously

## AI 摘要

卡内基梅隆大学的研究人员构建了一项新基准，用于衡量AI代理在利用谷歌V8引擎真实漏洞方面的能力。测试显示，Claude Mythos 的表现大幅领先 GPT-5.5，但其使用成本高达后者的十二倍。该基准表明，当前先进的AI模型已能自主开发有效的浏览器漏洞利用程序，这凸显了AI在网络安全领域兼具攻防双重潜力与风险。

## 正文

New benchmark shows Claude Mythos and GPT-5.5 can develop real browser exploits autonomously

Researchers at Carnegie Mellon University built a new benchmark that measures how far AI agents can go when exploiting real-world vulnerabilities in Google's JavaScript engine V8. Mythos leads GPT-5.5 by a wide margin, but it costs a fortune.

Unlike previous tests, the benchmark doesn't just check whether a bug gets triggered. It scores progress across five tiers, all the way up to arbitrary code execution, running whatever commands you want on the target system. V8 powers systems like Chrome, Edge, Node.js, and Cloudflare Workers.

Anthropic's Claude Mythos Preview, with occasional human hints ("nudges"), hit an average score of 9.90 out of 16 and reached the highest tier on 21 of 41 vulnerabilities. OpenAI's GPT-5.5 trailed far behind at 5.51 points, reaching the top tier on just two.

The gap gets even wider in fully autonomous mode. Mythos scored 9.55 points there, barely any drop. GPT-5.5 via Codex managed only 4.30. None of the other tested models achieved full code execution (T1).

The price tags differ sharply: the full Mythos test run across 122 episodes cost about $36,428, according to ExploitBench. GPT-5.5 via Codex ran 123 episodes for roughly $3,075, about twelve times cheaper. The UK's AI Safety Institute also confirmed that Mythos performs somewhat better than GPT-5.5 but at a much higher cost in a recent test. The price gap suggests OpenAI could close the performance gap by throwing more compute at the problem.

Mythos works like a "fairly competent" browser security researcher

ExploitBench co-author Seunghyun Lee—himself an experienced security researcher with over 20 reported browser vulnerabilities—reviewed the Mythos transcripts one by one. His takeaway: the model works like a "fairly competent browser / JS engine security researcher."

In one case, Mythos developed an exploit technique that Lee and a colleague had previously dismissed as too complex. In another, it reproduced a vulnerability (CVE-2024-0519) that human researchers had failed to crack for over a year, according to Lee.

The researchers acknowledge that the tested bugs are publicly known, and models could theoretically draw on training data. But the dataset also includes vulnerabilities with no public exploit or bug report. The benchmark doesn't yet measure the ability to find new flaws or fully weaponize an exploit for real attacks.

The benchmark is available on GitHub, and the paper is on arXiv. Anthropic and OpenAI provided API credits; the authors say all analysis was done independently.

AI News Without the Hype – Curated by Humans
