Rohan Paul@rohanpaul_ai

2026-06-19 21:09·13天前

AI 摘要

新研究对Anthropic Fable 5和Opus 4.8进行自动化红队攻击，持续改写有害提示词直至模型拒绝或生成坏答案。Fable 5最差攻击成功率6.1%，Opus 4.8为11.5%，证明最强LLM也无法完全免疫越狱——即便微小失败率，规模化自动化攻击仍可产生大量有害内容。旧式编码/角色扮演型越狱已非主要威胁，新弱点在于上下文：自适应攻击者在被拒后不断改写请求，寻找模型视为合法而非危险的框架。白宫与Anthropic正转向基于基准的测试框架，通过评分绕过程度、暴露能力、攻击可重复性及实际后果来量化越狱风险，而非追求不现实的完美免疫。

Perfect immunity from jailbreak is not possible even for the strongest of LLMs.

New study shows that frontier models are getting harder to jailbreak， but not impossible to jailbreak.

The study attacks Anthropic's Fable 5 and Opus 4.8 with automated red-team tools that keep rewriting harmful prompts until the model either refuses or gives a bad answer.

Fable 5 was more robust than Opus 4.8， with its worst attack success rate at 6.1%， while Opus 4.8 reached 11.5% under the strongest attack.

The hard truth is that avoiding absolutely every jailbreak is practically impossible， because even a tiny failure rate can produce many harmful completions when attacks are automated and repeated at scale.

The most crucial point is， that the old cartoon version of jailbreaks， weird encodings and theatrical role-play， is no longer the main problem.

The surviving weakness is contextual， because adaptive attackers rewrite the request after refusals， searching for a frame the model treats as legitimate rather than dangerous.

That is why perfect immunity is probably the wrong target； language models do not inspect intent from a clean moral altitude， they infer meaning through phrasing， context， and precedent.

In any system this flexible， there will always be boundary cases where a harmful request looks enough like education， safety research， fiction， troubleshooting， or policy analysis to slip through.

----

Link - arxiv. org/abs/2606.18193

Title： "A Red-Team Study of Anthropic Fable 5 & Opus 4.8 Models"

Rohan Paul@rohanpaul_ai · X

56导出 Markdown