# 新研究：最强LLM也无法完全免疫越狱--Fable 5与Opus 4.8自动化红队攻击分析

- 来源：Rohan Paul (@rohanpaul_ai)
- 发布时间：2026-06-19 21:09
- AIHOT 分数：56
- AIHOT 链接：https://aihot.virxact.com/items/cmqkykk3g00hcslu7ozf7x6m5
- 原文链接：https://x.com/rohanpaul_ai/status/2067957841303085226

## AI 摘要

新研究对Anthropic Fable 5和Opus 4.8进行自动化红队攻击，持续改写有害提示词直至模型拒绝或生成坏答案。Fable 5最差攻击成功率6.1%，Opus 4.8为11.5%，证明最强LLM也无法完全免疫越狱——即便微小失败率，规模化自动化攻击仍可产生大量有害内容。旧式编码/角色扮演型越狱已非主要威胁，新弱点在于上下文：自适应攻击者在被拒后不断改写请求，寻找模型视为合法而非危险的框架。白宫与Anthropic正转向基于基准的测试框架，通过评分绕过程度、暴露能力、攻击可重复性及实际后果来量化越狱风险，而非追求不现实的完美免疫。

## 正文

Perfect immunity from jailbreak is not possible even for the strongest of LLMs.

New study shows that frontier models are getting harder to jailbreak， but not impossible to jailbreak.

The study attacks Anthropic's Fable 5 and Opus 4.8 with automated red-team tools that keep rewriting harmful prompts until the model either refuses or gives a bad answer.

Fable 5 was more robust than Opus 4.8， with its worst attack success rate at 6.1%， while Opus 4.8 reached 11.5% under the strongest attack.

The hard truth is that avoiding absolutely every jailbreak is practically impossible， because even a tiny failure rate can produce many harmful completions when attacks are automated and repeated at scale.

The most crucial point is， that the old cartoon version of jailbreaks， weird encodings and theatrical role-play， is no longer the main problem.

The surviving weakness is contextual， because adaptive attackers rewrite the request after refusals， searching for a frame the model treats as legitimate rather than dangerous.

That is why perfect immunity is probably the wrong target； language models do not inspect intent from a clean moral altitude， they infer meaning through phrasing， context， and precedent.

In any system this flexible， there will always be boundary cases where a harmful request looks enough like education， safety research， fiction， troubleshooting， or policy analysis to slip through.

----

Link - arxiv. org/abs/2606.18193

Title： "A Red-Team Study of Anthropic Fable 5 & Opus 4.8 Models"

### 引用推文

> Rohan Paul：The White House and Anthropic may have found the first serious path to restore Mythos and Fable access without pretending jailbreaks can be eliminated. AI regul...
