# 白宫与Anthropic合作制定AI模型越狱评估框架

- 来源：Rohan Paul (@rohanpaul_ai)
- 发布时间：2026-06-19 20:29
- AIHOT 分数：67
- AIHOT 链接：https://aihot.virxact.com/items/cmqkxhy4v006aslu7mx1dkk7v
- 原文链接：https://x.com/rohanpaul_ai/status/2067947789578125391

## AI 摘要

白宫与Anthropic正合作制定正式技术评估框架，用于量化AI模型越狱攻击的严重性，并建立标准化评估方法论。框架将开发通用基准，评估防护被绕过程度、暴露的能力、攻击可重复性及实际运营后果。双方认识到完全免疫越狱并非可行目标。近期红队研究表明，强化后的前沿模型Fable 5比Opus 4.8更鲁棒，但在持续自动化攻击下仍产生有害输出。新框架每次以相同问题衡量风险，被视为更务实的监管路径。

## 正文

The White House and Anthropic may have found the first serious path to restore Mythos and Fable access without pretending jailbreaks can be eliminated.

AI regulation may be shifting from vague fear to a benchmark based tests of model failure， because completely removing absolutely all jailbreak is probably not a possible target.

The proposed framework would score how far the bypass went， what capabilities became reachable， how repeatable the attack was， and whether the exposed behavior had real operational consequences.

Both sides are now moving toward a shared way to score what a jailbreak actually means.

The hard truth is that perfect immunity is probably the wrong target； a recent red-team study found even hardened frontier models still produced confirmed harmful completions under sustained automated attack， with Fable 5 remaining more robust than Opus 4.8 but not invulnerable.

So once， for a newly released model， the governments can ask the same questions every time， how much was bypassed， what capability was exposed， how reproducible was the attack， and what damage could follow， thats a much more practical path.

### 引用推文

> Sophia Cai：NEW: White House and Anthropic are working to create a formal technical assessment framework that can quantify the severity of the jailbreak in question and cre...
